Changing Headers in .csv files

Question

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example

#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]

When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.

However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].

The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)

Here is what I have right now:

#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in 
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located

def reader (sets, df, dataLabels, skip, newHeader, pathName):
     for i in range(len(sets)):
        
        df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1] 
        df_temp.column.value[0] = [newHeader]
        for j in range(len(dataLabels)):
           df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')       
        df.append(df_temp)       
     return df

When I run my code, I run into the error:

No columns to parse from file

I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.

Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.

read_csv has a skiprows parameter that can be used to skip the first rows, a comment that can be set to # to skip the first rows. If nothing works, you could use df.iterrows() to find the index of the first "good" row then use good_df = df.iloc[(i + 1):].reset_index(drop=True) to get everything after that. You'll have to do some column-renaming and type-fixing in that case — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jul 27, 2022 at 17:31

Kraigolas · Accepted Answer · 2022-07-27 17:23:03Z

5

If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:

import pandas as pd

df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])

which gives

answered Jul 27, 2022 at 17:23

Kraigolas

5,6403 gold badges15 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Panagiotis Kanavos · Accepted Answer · 2022-07-27 17:44:39Z

If you're lucky, you can use Kraigolas' answer to treat those lines as comments.

In other cases you may be able to use the skiprows argument to skip header columns:

df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])

And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.

Unfortunately I also have very unfortunate files where the number of headings change.

In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row

This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:

data = df
if(first_col not in df.columns):
    # Skip rows until we find the first col header
    for i, row in df.iterrows():
        if row[0] == first_col:
            data = df.iloc[(i + 1):].reset_index(drop=True)
            # Read the column names
            series = df.iloc[i]
            series = series.str.strip()
            data.columns = list(series)

            # Use only existing column types
            types = {k: v for k, v in dtype.items() if k in data.columns}
            # Apply the column types again
            data = data.astype(dtype=types)
            break
return data

In this case the condition is finding the first column name (first_col) in the first cell.

This can be adopted to use different conditions, eg looking for the first numeric cell:

columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
    if row[0].isnumeric():
        data = df.iloc[(i + 1):].reset_index(drop=True)
        # Apply names and types
        data.columns = columns
        data = data.astype(dtype=dtypes)
            break
return data

SamR · Accepted Answer · 2022-07-27 17:29:35Z

1

Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:

files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])

answered Jul 27, 2022 at 17:29

SamR

23.1k4 gold badges23 silver badges55 bronze badges

Collectives™ on Stack Overflow

Changing Headers in .csv files

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related