Iterating through a folder and converting all text files to csv files Error

Question

I ran a Powershell code that outputted a bunch of text files.

The text files look like this:

This is my aText.txt

    Clark Kent
    Dolly Parten
    Charlie Brown
    Gary Numan

It's just text files with names, no header. I want these to now be converted to csv files, so I turned to Python and wrote this code:

    import os
    import pandas as pd
    
    folder = '\path\text\'
    csvFolder = '\path\csv\'
    
    for filename in os.listdir(folder):
    
        if filename.endswith('.txt'):
            file_path = os.path.join(folder, filename)
            csvpath = os.path.join(csvFolder, filename)
            
            #if file is empty
            if os.stat(file_path).st_size == 0:
                df = pd.DataFrame()
    
            #for other files
            else:
                df = pd.read_csv(file_path, header=0, names=None)
    
            csv_path = os.path.splitext(csvpath)[0] + '.csv'
    
            df.to_csv(csv_path, index=False)
    
    
    print("Text files have been converted to csv")

When I ran it, it gave me an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I did some research but didn't see anything for Pandas, only for the csv function. Someone included this under some responses:

    df = pd.read_csv(file_path, encoding='cp1252', header=0, names=None)

I tried it out and the program ran, but the csv files were corrupted with strange characters. I tried this on a test folder where I created text files and it ran fine and the output was good, but with the text files created from Powershell, the code runs (with no error messages) but the output isn't correct.

Here is an example of what I am seeing in the csv files after the conversion:

    ¿ Ã Ÿâ

The else statement seems to be where the error is occurring since this is where the conversion takes place. I ran df:

df = pd.read_csv(file_path, encoding='cp1252', header=0, names=None)
print("This is df: ", df)

This is the sample output:

This is df:      ÿþA
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN

You're making us guess where the error is. Please update the question to include the full error traceback message. — John Gordon
– John Gordon, Commented Oct 16, 2023 at 22:51
John, I updated my question. But I realized now how to fix the issue. I had the encoding completely wrong, since they're not encoded in the default UTF-8 that Pandas auto-assumes. — noobCoder
– noobCoder, Commented Oct 16, 2023 at 23:19
@noobCoder, can I answer my own question? You betcha! stackoverflow.com/help/self-answer — J_H
– J_H, Commented Oct 16, 2023 at 23:24
"use the right encoding when opening the file" isn't a very interesting solution. If that's all it was, you're probably better off just deleting the question. — John Gordon
– John Gordon, Commented Oct 16, 2023 at 23:27

noobCoder · Accepted Answer · 2023-10-16 23:34:32Z

1

I think I blew this issue out of proportion. I thought this was a much larger issue, but just playing around with the encoding while I waited for a response seemed to fix this. I simply added utf-16 in the encoding:

df = pd.read_csv(file_path, encoding='utf-16', header=0)
print("this is df: \n", df)

The output:

this is df:
Clark Kent
Dolly Parten
Charlie Brown
Gary Numan

edited Oct 16, 2023 at 23:34

answered Oct 16, 2023 at 23:26

noobCoder

1052 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Iterating through a folder and converting all text files to csv files Error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related