0

i'm using postman to send an excel file which i am reading in tornado.


Tornado code

self.request.files['1'][0]['body'].decode()

here if i send .csv than, the above code works.


if i send .xlsx file than i am stuck with this error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte


request.files will fetch the file but the type would be byte. so to convert byte to str i've used decode(), which works only for .csv and not for .xlsx

i tried decode('utf-8') but still no luck.

i've tried searching but didn't find any issue mentioning 0x87 problem?

3 Answers 3

1

The reason is that the .xlsx file has a different encoding, not utf-8. You'll need to use the original encoding to decode the file.

There's no guaranteed way of finding out the encoding of a file programmatically. I'm guessing you're making this application for general users and so you will keep encountering files with different and unexpected encodings.

A good way to deal with this is by trying to decode using multiple encodings, in case one fails. Example:

encodings = ['utf-8', 'iso-8859-1', 'windows-1251', 'windows-1252']

for encoding in encodings:
    try:
        decoded_file = self.request.files['1'][0]['body'].decode(encoding)
    except UnicodeDecodeError:
        # this will run when the current encoding fails
        # just ignore the error and try the next one
        pass
    else:
        # this will run when an encoding passes
        # break the loop
        # it is also a good idea to re-encode the 
        # decoded files to utf-8 for your purpose
        decoded_file = decoded_file.encode("utf8")
        break
else:
    # this will run when the for loop ends
    # without successfully decoding the file
    # now you can return an error message
    # to the user asking them to change 
    # the file encoding and re upload
    self.write("Error: Unidentified file encoding. Re-upload with UTF-8 encoding")
    return

# when the program reaches here, it means 
# you have successfully decoded the file 
# and you can access it from `decoded_file` variable

Here's a list of some common encodings: What is the most common encoding of each language?

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for the idea of multiple encoding. this .xlsx decodes by 'iso-8859-1' but when i encode by utf-8 the output is just some random bytes which i don't understand. I've tried different encoding but still no luck. can you please try decoding this file? test.xlsx
@MukeshSuthar You're reading the data as plain string (or bytes). You need to treat the data as as xlsx format. But that would be a lot of code to write. Use a library called openpyxl which will help you read the data as xlsx.
@MukeshSuthar One more thing, if you want to read the data directly from memory without saving to disk first, see this answer.
Is conversion to CSV an Option (maybe using a python-triggered VBA binary)?
@xyres EQObject = self.request.files['0'][0]['body'].decode('iso-8859-1').encode('utf-8') openpyxl.load_workbook(filename=BytesIO(EQObject)). Error: zipfile.BadZipFile: Bad magic number for central directory. I dont understand zip error, i'm getting input as byte which is xlsx, and how this zip guy in between poping ambigous errors.
|
1

I faced the same issue and this worked for me.

    import io
    
    df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
0

try this one, following suggestions provided here:

self.request.files['1'][0]['body'].decode('iso-8859-1').encode('utf-8')

6 Comments

AttributeError: 'bytes' object has no attribute 'str'
try it after removing the first "str" - I have updated my answer accordingly. if this doesn't work, try to remove the second "str" while keeping the first one
still same error AttributeError: 'bytes' object has no attribute 'str' can you tell me why do we need to encode it after decoding?
just remove all the 'str' - decode/encode is a common conversion flow
removed all 'str', the code works. but the output is just some random byte formatted text. output: \x00\x11\x00\x07\x04\x00\x00\x18K\x00\x00\x00\x00' ... goes-on..
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.