0

I have a text file, its size is more than 200 MB. I want to read it and then want to select 30 most frequently used words. When i run it, it give me error. The code is as under:-

    import sys, string 
    import codecs 
    from collections import Counter
    import collections
    import unicodedata
    with open('E:\\Book\\1800.txt', "r", encoding='utf-8') as File_1800:
    for line in File_1800: 
       sepFile_1800 = line.lower()
        words_1800 = re.findall('\w+', sepFile_1800)
    for wrd_1800 in [words_1800]:
        long_1800=[w for w in wrd_1800 if len(w)>3]
        common_words_1800 = dict(Counter(long_1800).most_common(30))
    print(common_words_1800)


    Traceback (most recent call last):
    File "C:\Python34\CommonWords.py", line 14, in <module>
    for line in File_1800:
    File "C:\Python34\lib\codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position    
    3784: invalid start byte
1
  • 1
    Not sure if it is the same in your actual code, but your indentation is off. Commented Sep 17, 2015 at 6:13

2 Answers 2

1

The file does not contain 'UTF-8' encoded data. Find the correct encoding and update the line: with open('E:\\Book\\1800.txt', "r", encoding='correct_encoding')

Sign up to request clarification or add additional context in comments.

5 Comments

Can you tell me how to find the correct encoding? Actually i am new to Python.
You can use Notepad++ editor to determine the encoding. It usually gets it right, but it's not a 100%
You can use Notepad++ editor to determine the encoding. It usually gets it right although it's not a 100%. You can also try some popular options like ` ISO-8859-1`.
I tried "ISO-8859-1" it gives me this result.... "{'subscribe': 1, 'email': 1, 'ebooks': 1, 'newsletter': 1, 'hear': 1, 'about': 1}"... This file contain more than 90000000 words. I tried Notepad++ Open the file in Notepad++, click on "Encoding" and it shows "Encoded in ANSI".
Well the Encoded in ANSI part suggests it is ANSI encoded format. It is also referred to as Windows-1252 or CP-1252` (which you can try using).
0

Try encoding='latin1' instead of utf-8

Also, in these lines:

for line in File_1800:
    sepFile_1800 = line.lower()
    words_1800 = re.findall('\w+', sepFile_1800)
for wrd_1800 in [words_1800]:
    ...

The script is re-assigning the matches of re.findall to the words_1800 variable for every line. So when you get to for wrd_1800 in [words_1800], the words_1800 variable only has matches from the very last line.

If you want to make minimal changes, initialize an empty list before iterating through the file:

words_1800 = []

And then add the matches for each line to the list, rather than replacing the list:

words_1800.extend(re.findall('\w+', sepFile_1800))

Then you can do (without the second for loop):

long_1800 = [w for w in words_1800 if len(w) > 3]
common_words_1800 = dict(Counter(long_1800).most_common(30))
print(common_words_1800)

4 Comments

the result is... {'ebooks': 1, 'hear': 1, 'subscribe': 1, 'email': 1, 'newsletter': 1, 'about': 1}... This file contains more than 90000000 words.
Oh I just meant to fix the UnicodeDecodeError - I updated the answer with some comments on your code.
Thanks. it worked but not fully. For a 60 MB file it worked but for the other file (300 MB) it gives me Error. ....................................."Traceback (most recent call last): File "C:\Python34\CommonWords.py", line 17, in <module> words_1800.extend(re.findall('\w+', sepFile_1800)) MemoryError......"
There are a few changes you can make for your code to be more efficient. You could post a new question, since that's a different topic.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.