Rename a column header in csv using python pandas

Question

I have some giant CSV files - like 23 GB size - in which i want to accomplish this with their column headers -

If there is a column name SFID, perform this - Rename column "Id" to "IgnoreId" Rename column "SFID" to "Id" else- Do nothing

All the google search results i see are about how to import the csv in a dataframe, rename the column, export it back into a csv.

To me it feels like giant waste of time/memory, because we are effectively just working with very first row of the CSV file (which represents headers). I dont know if it is necessary to load whole csv as dataframe and export to a new csv (or export it to same csv, effectively overwriting it).

Being huge CSVs, i have to load them in small chunksize and perform the operation which takes time and memory. Again, feels liek waste of memory becuase apart from the headers, we are not really doing anything with remaining chunksizes

Is there a way i just load up header of a csv file, make changes to headers, and save it back into same csv file?

I am open to ideas of using something other that pandas as well. Only real constraint is that CSV files are too big to just double click and open.

file can't move data when you want to put longer or shorter headers. You would have to create new file, write new headers and then copy all rows from old file to new file. CSV is normal text file so you can use standard open(), read(), write() to do this (in some chunks or in few rows at once) but it can be better to do this in bytes mode - open(... ,'rb'), open(..., 'wb') because text mode may convert some chars and sometimes it can make problem. — furas
– furas, Commented Sep 17, 2019 at 5:48

Blessy · Accepted Answer · 2019-09-17 07:47:09Z

Write the header row first and copy the data rows using shutil.copyfileobj

shutil.copyfileobj took 38 seconds for a 0.5 GB file whereas fileinput took 125 seconds for the same.

Using shutil.copyfileobj

df = pd.read_csv(filename, nrows=0) # read only the header row
if 'SFID' in df.columns:
    # rename columns
    df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
    # construct new header row
    header_row = ','.join(df.columns) + "\n"
    # modify header in csv file
    with open(filename, "r+") as f1, open(filename, "r+") as f2:
        f1.readline() # to move the pointer after header row
        f2.write(header_row)
        shutil.copyfileobj(f1, f2) # copies the data rows

Using fileinput

if 'SFID' in df.columns:
    # rename columns
    df.rename(columns = {"Id": "IgnoreId", "SFID":"Id"}, inplace = True)
    # construct new header row
    header_row = ','.join(df.columns)
    # modify header in csv file
    f = fileinput.input(filename, inplace=True)
    for line in f:
        if fileinput.isfirstline():
            print(header_row)
        else:
            print(line, end = '')
    f.close()

Stef · Accepted Answer · 2019-09-18 19:28:40Z

0

For huge file a simple command line solution with the stream editor sed might be faster than a python script:

sed -e '1 {/SFID/ {s/Id/IgnoreId/; s/SFID/Id/}}' -i myfile.csv

This changes Id to IgnoreId and SFID to Id in the first line if it contains SFID. If other column header also contain the string Id (e.g. ImportantId) then you'll have to refine the regexes in the s command accordingly.

answered Sep 18, 2019 at 19:28

Stef

30.9k3 gold badges34 silver badges60 bronze badges

Collectives™ on Stack Overflow

Rename a column header in csv using python pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related