I'm using Python 3.12.6 and pandas==2.2.3.
This is a simple code, that I've always used and always worked to read the first sheet of an excel file:
df = pd.read_excel(file_path, engine='openpyxl', sheet_name=0, index_col=None)
However, I have an excel file that is behaving strangely. This is the sheet header, it has these columns:
"NOME", "DATA INSCRIÇÃO ", "PROVA OBJETIVA", "PROVA DISCURSIVA"
However, note that there are a few line breaks in some cells that might be strange to utf8 encoding:
read_excel() only reads up to the column "DATA INSCRIÇÃO".
print(df.columns)
Index(['NOME', 'DATA INSCRIÇÃO '],
dtype='object')
When I save this sheet to .csv and open with notepad, this is what I see:
NOME;DATA INSCRIÇÃO ;"PROVA
OBJETIVA";"PROVA
DISCURSIVA"
I've noticed there are line breaks, as well as quotes, precisely on the problematic columns. Anyone has any idea why it's breaking? Or a better way to read all the columns in Python?
If I save the sheet to .csv and read_csv(), it breaks with an encoding error, which I suspect is the problem. BUT, if I try this:
df = pd.read_csv(csv_path, delimiter=';', encoding='latin1')
It works! If I'm interpreting this correctly, this tells me that there might be a latin1 encoded line break that read_excel can't read. The problem is: read_excel() has no encoding argument. I've looked at the other possible arguments to read_excel, but nothing seems to help. Any help would be greatly appreciated.
name=["NOME", "DATA INSCRIÇÃO ","PROVA OBJETIVA", "PROVA DISCURSIVA"]including new lines