Encoding error when reading csv file containing pandas dataframe

Question

I used df.to_csv() to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.

However when I run pd.read_csv() on the same file, I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte

But using pd.read_csv() with encoding="ISO-8859-1" works.

What is the issue here and how do I resolve it so I can write and load files with consistent encoding?

Can you give an example dataframe that reproduces this problem? — Tadhg McDonald-Jensen, Commented May 11, 2016 at 3:34
@TadhgMcDonald-Jensen Unable to, but I consistently get the problem for the dataframes I created using other data sets from the same source. — hangc, Commented May 11, 2016 at 4:58

Rahila T - Intel · Accepted Answer · 2019-02-01 07:25:24Z

3

Please try to read the data using encoding='unicode_escape'.

answered Feb 1, 2019 at 7:25

Rahila T - Intel

8624 silver badges11 bronze badges

Add a comment |

Stefan · Accepted Answer · 2016-05-11 06:38:13Z

2

The original .csv you are trying to read is encoded in e.g. ISO-8859-1. That's why it's a UnicodeDecodeError - python / pandas is trying to decode the source using utf-8 codec assuming per default the source is unicode.

Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.

See python docs and more here. Also very good.

answered May 11, 2016 at 6:38

Stefan

42.9k13 gold badges79 silver badges83 bronze badges

I did not try to read using ISO-8859-1 as I encountered no Error while reading the original file in utf-8. Does this mean the error is possible even though I was able to load the original file and write it to my disk?
– hangc
Commented May 11, 2016 at 6:43
You are right that this is how it's supposed to work. If your file is utf8 encoded as python 3 does indeed per default, you should be able to read it in without problems. This typically breaks when opening with another app in between, like excel.. Would need enough detail about file and process to be able to reproduce the error, sorry, so far can just interpret the error message.
– Stefan
Commented May 11, 2016 at 7:10
Adding the parameters encoding='utf-8' to df.to_csv() solves the problem.
– hangc
Commented May 19, 2016 at 2:08

Add a comment |

Kardo Paska · Accepted Answer · 2017-05-29 02:33:43Z

Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding parameter with pandas.to_csv.

0x92 is ’ (looks like an apostrophe)

import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'

df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])

df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")

df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)

df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

So it looks like you have to explicitly use encoding="utf-8" with to_csv even though pandas docs say it is using this by default. Or use encoding="Latin-1" with read_csv.

Even more frustrating...

df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am using Windows 7, Python 3.5, pandas 0.19.2.

Had the same issue, converted an excel df to a csv. Later tried to read the csv and saw the 'utf-8' error - invalid start byte. Using pd.read_csv() with encoding="ISO-8859-1" works but not sure why. Maybe its a bug in Python 3.6? — Arthur D. Howland, Commented Aug 21, 2017 at 17:48

Collectives™ on Stack Overflow

Encoding error when reading csv file containing pandas dataframe

3 Answers 3

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Related