92

I'm attempting to read a CSV file into a Dataframe in Pandas. When I try to do that, I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 55: invalid start byte

This is from code:

import pandas as pd

location = r"C:\Users\khtad\Documents\test.csv"

df = pd.read_csv(location, header=0, quotechar='"')

This is on a Windows 7 Enterprise Service Pack 1 machine and it seems to apply to every CSV file I create. In this particular case the binary from location 55 is 00101001 and location 54 is 01110011, if that matters.

Saving the file as UTF-8 with a text editor doesn't seem to help, either. Similarly, adding the param "encoding='utf-8' doesn't work, either--it returns the same error.

What is the most likely cause of this error and are there any workarounds other than abandoning the DataFrame construct for the moment and using the csv module to read in the CSV line-by-line?

9
  • 3
    have you tried passing param encoding='utf-8' to read_csv? Commented May 26, 2015 at 15:29
  • 2
    or have you tried reading the file using csv module to check if there is an issue with the file itself? Commented May 26, 2015 at 16:15
  • 1
    @EdChum I'll add that to the question, but yes, that's one of the things I tried. Commented May 26, 2015 at 17:06
  • 3
    You'll have to post raw input or a link to the data, you could also try utf-16' just in case for the encoding` Commented May 26, 2015 at 17:12
  • 2
    Please don't use pd.DataFrame.from_csv it is no longer maintained, use the top level pd.read_csv as it more feature rich Commented May 27, 2015 at 10:16

2 Answers 2

224

Try calling read_csv with encoding='latin1', encoding='iso-8859-1' or encoding='cp1252' (these are some of the various encodings found on Windows).

Sign up to request clarification or add additional context in comments.

6 Comments

I was able to use all 3 of these encodings successfully.
Carefully choose the encoding. There are a few differences such as typographic quotes. Another common one is iso-8859-15, which includes the EUR sign.
This was the first thread I stumbled upon about this problem, so just for the sake of completeness: None of the above worked for my (similar) problem, but UTF-16 as encoding did work. Try this if the ones mentioned by maxymoo fail.
removing the encoding attribute worked for me
encoding='iso-8859-1' worked for me on windows.
|
22

This works in Mac as well you can use

df= pd.read_csv('Region_count.csv', encoding ='latin1')

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.