As output from another service, I have received many Excel files that have a
".csv" filename extension but are saved as "Unicode Text (*.txt)".
To my knowledge they don't have an Unicode characters so I am not worried about data loss and if I was producing the data I would not have saved it this way. However, I need to process 100s of these files and I have been unable to import them using Python (specifically Python 3).
I have tried many different options such as the
csv module, pandas.read_csv(), and pyexcel.get_sheet()
to import the Excel file directly without any sucess. Often with errors such as
"... can't decode byte 0xff in position 0: invalid start byte".
I can manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" file type which can then be imported (e.g. pyexcel.get_sheet() ), but can't figure out how to do this programmatically.
I can also manually open the original file in Notepad and save it as a text file with a ".txt" extension and ANSI encoding, which allows me to import the data using numpy.loadtxt(). This isn't ideal because it is also manual. Additionally, I don't know why I need to convert to ANSI encoding and can't use UTF-8 for it to be read using
numpy.loadtxt(file_name, encoding="UTF-8"),
which results in an error such as
"... ValueError: could not convert string to float: '\ufeff ..."
and the following error using just numpy.loadtxt(file_name)
"... ValueError: could not convert string to float: ' ..."
In summary, my main goal is to find a programmatic way to change the Excel file type/encoding to something I can import into Python 3 using existing packages with CSV support. Additionally, if someone has any idea why numpy.loadtxt can't import a text file with a UTF-8 encoding (but can for ANSI encoding) that would be great too! Any knowledge to help me understand the problem (or my misconception of the problem) is appreciated.

loadtxt(andgenfromtxt) defaults to parsing the data as floats - it tries to convert each string to a number. That isn't a unicode or bytestring issue. You need to specify the appropriatedtype. For a start I'd suggestdtype=None, which lets it deduce the right dtype for each column. You may also need to handle the header lines.