How to use Python 3 to import Excel files with CSV extensions saved as Unicode Text?

Question

As output from another service, I have received many Excel files that have a

".csv" filename extension but are saved as "Unicode Text (*.txt)".

To my knowledge they don't have an Unicode characters so I am not worried about data loss and if I was producing the data I would not have saved it this way. However, I need to process 100s of these files and I have been unable to import them using Python (specifically Python 3).

I have tried many different options such as the

csv module, pandas.read_csv(), and pyexcel.get_sheet()

to import the Excel file directly without any sucess. Often with errors such as

"... can't decode byte 0xff in position 0: invalid start byte".

I can manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" file type which can then be imported (e.g. pyexcel.get_sheet() ), but can't figure out how to do this programmatically.

I can also manually open the original file in Notepad and save it as a text file with a ".txt" extension and ANSI encoding, which allows me to import the data using numpy.loadtxt(). This isn't ideal because it is also manual. Additionally, I don't know why I need to convert to ANSI encoding and can't use UTF-8 for it to be read using

numpy.loadtxt(file_name, encoding="UTF-8"),

which results in an error such as

"... ValueError: could not convert string to float: '\ufeff ..."

and the following error using just numpy.loadtxt(file_name)

"... ValueError: could not convert string to float: 'ï»¿ ..."

In summary, my main goal is to find a programmatic way to change the Excel file type/encoding to something I can import into Python 3 using existing packages with CSV support. Additionally, if someone has any idea why numpy.loadtxt can't import a text file with a UTF-8 encoding (but can for ANSI encoding) that would be great too! Any knowledge to help me understand the problem (or my misconception of the problem) is appreciated.

loadtxt (and genfromtxt) defaults to parsing the data as floats - it tries to convert each string to a number. That isn't a unicode or bytestring issue. You need to specify the appropriate dtype. For a start I'd suggest dtype=None, which lets it deduce the right dtype for each column. You may also need to handle the header lines. — hpaulj
– hpaulj, Commented Apr 18, 2018 at 21:59
@hpaulj: I appreciate the suggestion however this did not solve the problem when trying to load the text files with UTF-8 encoding. The same Value Error occurs when using dtype=None. The data in the text file does not have any headers and all values are floats. — jpantina
– jpantina, Commented Apr 18, 2018 at 22:24
Without a sample file, it'll be hard to help you. If, for example, you posted a few lines of the file, I could copy-n-paste that and make my own file to test. — hpaulj
– hpaulj, Commented Apr 18, 2018 at 22:29
A recent utf8 csv example: stackoverflow.com/questions/49859957/… — hpaulj
– hpaulj, Commented Apr 18, 2018 at 22:37
@hpaulj: I just made a text file with "1\t2\n3\t4\n" and saved it as .txt with UTF-8 encoding and get the same error as my other data. After save-as to change the file to ANSI encoding it loads with no problems using np.loadtxt. This doesn't really solve my problem anyway since I was doing it manually. — jpantina
– jpantina, Commented Apr 18, 2018 at 22:51

Mark Ransom · Accepted Answer · 2018-04-19 01:34:12Z

3

Your file has a UTF-8 BOM at the front. Python can strip this automatically with the utf_8_sig codec:

numpy.loadtxt(file_name, encoding="utf_8_sig"),

answered Apr 19, 2018 at 1:34

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Mark Tolonen Over a year ago

OP got FF as the first byte, so utf16. UTF-8 BOM starts with EF.

Mark Tolonen Over a year ago

@roeland No, "can't decode byte 0xff in position 0: invalid start byte". The first byte read from the file was FF. That's a UTF-16 BOM. See my answer which reads a Unicode Text-saved Excel file.

Mark Ransom Over a year ago

@MarkTolonen I'm looking at the two error messages at the end, one of which clearly is the BOM as a single Unicode character and the other which is the 3 byte UTF-8 equivalent. There's nothing in the question that indicates these files originated in Excel. It will be interesting to see which of us is right.

Mark Tolonen Over a year ago

The first two lines of the question: I have received many Excel files that have a ".csv" filename extension but are saved as "Unicode Text (*.txt)". That last part is a save option in Excel, which I used in my answer to generate the file. Also, the last half of the question states the OP attempted "manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" which resulted in the later errors.

Mark Ransom Over a year ago

@MarkTolonen How did I miss that? Must be time to give up for the night.

|

Mark Tolonen · Accepted Answer · 2018-04-19 03:09:32Z

2

Excel's "Unicode Text" is UTF-16-encoded and tab-delimited. With the csv module, use:

import csv
with open('book1.txt',encoding='utf16') as f:
    r = csv.reader(f,delimiter='\t')
    for row in r:
        print(row)

With the following Excel sheet saved as "Unicode Text" (Excel 2016):

Excel table of English/Chinese

This produces (Python 3.6):

['English', 'Chinese']
['Mark', '马克']
['American', '美国人']

Pandas also works (but not with utf16...it oddly needed the hyphen):

>>> import pandas as pd
>>> pd.read_csv('book1.txt',delimiter='\t',encoding='utf-16')
    English Chinese
0      Mark      马克
1  American     美国人

edited Apr 19, 2018 at 3:09

answered Apr 19, 2018 at 1:58

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

1 Comment

jpantina Over a year ago

@Mark_Tolonen: Thanks for the answer and explanation, it solved my problem.

hpaulj · Accepted Answer · 2018-04-19 00:25:31Z

0

Default dtype is float. genfromtxt uses nan when it can't parse the string:

In [8]: np.genfromtxt(['one, 2, 3'], delimiter=',',encoding=None)
Out[8]: array([nan,  2.,  3.])

loadtxt raises an error:

In [9]: np.loadtxt(['one, 2, 3'], delimiter=',',encoding=None)   
....
ValueError: could not convert string to float: 'one'
In [10]: np.loadtxt(['1, 2, 3'], delimiter=',',encoding=None)
Out[10]: array([1., 2., 3.])

genfromtxt with dtype=None:

In [12]: np.genfromtxt(['one, 2, 3'], delimiter=',',encoding=None,dtype=None)
Out[12]: array(('one', 2, 3), dtype=[('f0', '<U3'), ('f1', '<i8'), ('f2', '<i8')])

from your comment:

In [13]: np.genfromtxt('1\t2\n3\t4\n'.splitlines(), delimiter='\t',encoding=None
    ...: )
Out[13]: 
array([[1., 2.],
       [3., 4.]])

In [21]: print('1\t2\n3\t4\n')
1   2
3   4

Using encoding='utf8' works as well.

edited Apr 19, 2018 at 0:25

answered Apr 18, 2018 at 23:14

hpaulj

233k14 gold badges260 silver badges392 bronze badges

2 Comments

jpantina Over a year ago

While useful knowledge, I respectfully can't accept this as an answer. This doesn't address my original question regarding file types. Additionally, your genfromtxt example (the "\t" and "\n" were just shorthand to represent the data for the file) isn't the same because it it should be saved to a text file with UTF-8 encoding to replicate my issue.

hpaulj Over a year ago

I demonstrated this in a Py3 Ipython session, so the strings are unicode. In earlier numpy versions I would have use bytestrings, b'1,2'. A utf8 file with the same values works the same.

Collectives™ on Stack Overflow

How to use Python 3 to import Excel files with CSV extensions saved as Unicode Text?

3 Answers 3

7 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related