1

I ran a Powershell code that outputted a bunch of text files.

The text files look like this:

This is my aText.txt

    Clark Kent
    Dolly Parten
    Charlie Brown
    Gary Numan

It's just text files with names, no header. I want these to now be converted to csv files, so I turned to Python and wrote this code:

    import os
    import pandas as pd
    
    folder = '\path\text\'
    csvFolder = '\path\csv\'
    
    for filename in os.listdir(folder):
    
        if filename.endswith('.txt'):
            file_path = os.path.join(folder, filename)
            csvpath = os.path.join(csvFolder, filename)
            
            #if file is empty
            if os.stat(file_path).st_size == 0:
                df = pd.DataFrame()
    
            #for other files
            else:
                df = pd.read_csv(file_path, header=0, names=None)
    
            csv_path = os.path.splitext(csvpath)[0] + '.csv'
    
            df.to_csv(csv_path, index=False)
    
    
    print("Text files have been converted to csv")

When I ran it, it gave me an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I did some research but didn't see anything for Pandas, only for the csv function. Someone included this under some responses:

    df = pd.read_csv(file_path, encoding='cp1252', header=0, names=None)

I tried it out and the program ran, but the csv files were corrupted with strange characters. I tried this on a test folder where I created text files and it ran fine and the output was good, but with the text files created from Powershell, the code runs (with no error messages) but the output isn't correct.

Here is an example of what I am seeing in the csv files after the conversion:

    ¿ Ã Ÿâ

The else statement seems to be where the error is occurring since this is where the conversion takes place. I ran df:

df = pd.read_csv(file_path, encoding='cp1252', header=0, names=None)
print("This is df: ", df)

This is the sample output:

This is df:      ÿþA
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
4
  • 1
    You're making us guess where the error is. Please update the question to include the full error traceback message. Commented Oct 16, 2023 at 22:51
  • John, I updated my question. But I realized now how to fix the issue. I had the encoding completely wrong, since they're not encoded in the default UTF-8 that Pandas auto-assumes. Commented Oct 16, 2023 at 23:19
  • 1
    @noobCoder, can I answer my own question? You betcha! stackoverflow.com/help/self-answer Commented Oct 16, 2023 at 23:24
  • "use the right encoding when opening the file" isn't a very interesting solution. If that's all it was, you're probably better off just deleting the question. Commented Oct 16, 2023 at 23:27

1 Answer 1

1

I think I blew this issue out of proportion. I thought this was a much larger issue, but just playing around with the encoding while I waited for a response seemed to fix this. I simply added utf-16 in the encoding:

df = pd.read_csv(file_path, encoding='utf-16', header=0)
print("this is df: \n", df)

The output:

this is df:
Clark Kent
Dolly Parten
Charlie Brown
Gary Numan
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.