Getting UnicodeDecodeError while reading excel in Tornado,Python

Question

i'm using postman to send an excel file which i am reading in tornado.

Tornado code

self.request.files['1'][0]['body'].decode()

here if i send .csv than, the above code works.

if i send .xlsx file than i am stuck with this error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte

request.files will fetch the file but the type would be byte. so to convert byte to str i've used decode(), which works only for .csv and not for .xlsx

i tried decode('utf-8') but still no luck.

i've tried searching but didn't find any issue mentioning 0x87 problem?

xyres · Accepted Answer · 2018-06-06 10:48:00Z

The reason is that the .xlsx file has a different encoding, not utf-8. You'll need to use the original encoding to decode the file.

There's no guaranteed way of finding out the encoding of a file programmatically. I'm guessing you're making this application for general users and so you will keep encountering files with different and unexpected encodings.

A good way to deal with this is by trying to decode using multiple encodings, in case one fails. Example:

encodings = ['utf-8', 'iso-8859-1', 'windows-1251', 'windows-1252']

for encoding in encodings:
    try:
        decoded_file = self.request.files['1'][0]['body'].decode(encoding)
    except UnicodeDecodeError:
        # this will run when the current encoding fails
        # just ignore the error and try the next one
        pass
    else:
        # this will run when an encoding passes
        # break the loop
        # it is also a good idea to re-encode the 
        # decoded files to utf-8 for your purpose
        decoded_file = decoded_file.encode("utf8")
        break
else:
    # this will run when the for loop ends
    # without successfully decoding the file
    # now you can return an error message
    # to the user asking them to change 
    # the file encoding and re upload
    self.write("Error: Unidentified file encoding. Re-upload with UTF-8 encoding")
    return

# when the program reaches here, it means 
# you have successfully decoded the file 
# and you can access it from `decoded_file` variable

Here's a list of some common encodings: What is the most common encoding of each language?

Thanks for the idea of multiple encoding. this .xlsx decodes by 'iso-8859-1' but when i encode by utf-8 the output is just some random bytes which i don't understand. I've tried different encoding but still no luck. can you please try decoding this file? test.xlsx
@MukeshSuthar You're reading the data as plain string (or bytes). You need to treat the data as as xlsx format. But that would be a lot of code to write. Use a library called openpyxl which will help you read the data as xlsx.
@MukeshSuthar One more thing, if you want to read the data directly from memory without saving to disk first, see this answer.
Is conversion to CSV an Option (maybe using a python-triggered VBA binary)?
@xyres EQObject = self.request.files['0'][0]['body'].decode('iso-8859-1').encode('utf-8') openpyxl.load_workbook(filename=BytesIO(EQObject)). Error: zipfile.BadZipFile: Bad magic number for central directory. I dont understand zip error, i'm getting input as byte which is xlsx, and how this zip guy in between poping ambigous errors.

Shantanu Verma · Accepted Answer · 2021-10-14 09:35:55Z

1

I faced the same issue and this worked for me.

    import io
    
    df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))

answered Oct 14, 2021 at 9:35

Shantanu Verma

112 bronze badges

1 Comment

Marinario Agalliu Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

sudonym · Accepted Answer · 2018-06-06 08:23:31Z

0

try this one, following suggestions provided here:

self.request.files['1'][0]['body'].decode('iso-8859-1').encode('utf-8')

edited Jun 6, 2018 at 8:23

answered Jun 6, 2018 at 6:55

sudonym

4,0384 gold badges40 silver badges63 bronze badges

6 Comments

Mukesh Suthar Over a year ago

AttributeError: 'bytes' object has no attribute 'str'

sudonym Over a year ago

try it after removing the first "str" - I have updated my answer accordingly. if this doesn't work, try to remove the second "str" while keeping the first one

Mukesh Suthar Over a year ago

still same error AttributeError: 'bytes' object has no attribute 'str' can you tell me why do we need to encode it after decoding?

sudonym Over a year ago

just remove all the 'str' - decode/encode is a common conversion flow

Mukesh Suthar Over a year ago

removed all 'str', the code works. but the output is just some random byte formatted text. output: \x00\x11\x00\x07\x04\x00\x00\x18K\x00\x00\x00\x00' ... goes-on..

|

Collectives™ on Stack Overflow

Getting UnicodeDecodeError while reading excel in Tornado,Python

Tornado code

3 Answers 3

8 Comments

1 Comment

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Tornado code

3 Answers 3

8 Comments

1 Comment

6 Comments

Linked

Related