2

I used df.to_csv() to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.

However when I run pd.read_csv() on the same file, I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte

But using pd.read_csv() with encoding="ISO-8859-1" works.

What is the issue here and how do I resolve it so I can write and load files with consistent encoding?

2
  • Can you give an example dataframe that reproduces this problem? Commented May 11, 2016 at 3:34
  • @TadhgMcDonald-Jensen Unable to, but I consistently get the problem for the dataframes I created using other data sets from the same source.
    – hangc
    Commented May 11, 2016 at 4:58

3 Answers 3

3

Please try to read the data using encoding='unicode_escape'.

2

The original .csv you are trying to read is encoded in e.g. ISO-8859-1. That's why it's a UnicodeDecodeError - python / pandas is trying to decode the source using utf-8 codec assuming per default the source is unicode.

Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.

See python docs and more here. Also very good.

3
  • I did not try to read using ISO-8859-1 as I encountered no Error while reading the original file in utf-8. Does this mean the error is possible even though I was able to load the original file and write it to my disk?
    – hangc
    Commented May 11, 2016 at 6:43
  • You are right that this is how it's supposed to work. If your file is utf8 encoded as python 3 does indeed per default, you should be able to read it in without problems. This typically breaks when opening with another app in between, like excel.. Would need enough detail about file and process to be able to reproduce the error, sorry, so far can just interpret the error message.
    – Stefan
    Commented May 11, 2016 at 7:10
  • Adding the parameters encoding='utf-8' to df.to_csv() solves the problem.
    – hangc
    Commented May 19, 2016 at 2:08
0

Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding parameter with pandas.to_csv.

0x92 is ’ (looks like an apostrophe)

import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'

df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])

df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")

df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)
df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

So it looks like you have to explicitly use encoding="utf-8" with to_csv even though pandas docs say it is using this by default. Or use encoding="Latin-1" with read_csv.

Even more frustrating...

df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am using Windows 7, Python 3.5, pandas 0.19.2.

1
  • 2
    Had the same issue, converted an excel df to a csv. Later tried to read the csv and saw the 'utf-8' error - invalid start byte. Using pd.read_csv() with encoding="ISO-8859-1" works but not sure why. Maybe its a bug in Python 3.6? Commented Aug 21, 2017 at 17:48

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.