Python | Combine multiple csv (100+) files from one folder taking csv header into consideration

Question

Requirement: I have a folder with multiple csv files. I need to perform following:

Scan the input folder for all csv files (file1.csv, file2.csv ..... filen.csv etc) & perform below steps
Open the first csv file (file1.csv) & store the file header in a list & then copy the entire content (including header) of file1.csv & put in combined.csv => Then move file1.csv to a /done folder under the parent folder
Move to next file file2.csv => Compare the header of file2.csv with the header of the first file to ensure they match exactly. If the headers match then copy contents (excluding header) of file2.csv & put in combined.csv => Then move file2.csv to a /done folder under the parent folder. If the header of file2.csv do not match then exclude combining this file, leave it in the same parent folder & move to next file for combining

I have placed sample files on the link gdrive folder with sample CSV files I am open to any solution either with CSV or pandas as long as it does what I want

As a starting point, I was initially working on comparing the header as below which works. However, I am unsure on how to move further

Code:

    import csv        
    def compare_two_csv_headers(csv_file1, csv_file2):
        with open(csv_file1, newline='') as f:
            reader = csv.reader(f)
            frow1 = next(reader)  # gets the first line
            print(frow1)
        with open(csv_file2, newline='') as f:
            reader = csv.reader(f)
            frow2 = next(reader)  # gets the first line
            print(frow2)
        if frow1==frow2:
            print('Same header')
        else:
            print('Different header')

    csv_file1 = 'D:/2009/cm01JAN2009bhav.csv'
    csv_file2 = 'D:/2009/cm01DEC2009bhav.csv'
    compare_two_csv_headers(csv_file1, csv_file2)

Here are the top 10 lines from the first csv file

SYMBOL,SERIES,OPEN,HIGH,LOW,CLOSE,LAST,PREVCLOSE,TOTTRDQTY,TOTTRDVAL,TIMESTAMP,
20MICRONS,EQ,46.5,47,45.7,46.05,46,46.55,7092,328975.25,31-DEC-2009,
3IINFOTECH,EQ,85.8,86.7,84.5,85.15,85.35,85.05,2423812,207760480.3,31-DEC-2009,
3MINDIA,EQ,1855.05,1879.9,1855.05,1865.75,1874.95,1850.45,85,158679.1,31-DEC-2009,
AARTIDRUGS,EQ,107.4,108.75,103.65,104.45,104.9,106.05,84012,8929759.4,31-DEC-2009,
AARTIIND,EQ,51,51.9,48.9,49.2,49.1,50.45,149365,7517110.3,31-DEC-2009,
AARVEEDEN,EQ,64,64.5,63.05,63.85,63.1,62.7,2172,138651.5,31-DEC-2009,
ABAN,EQ,1265,1297,1265,1283.65,1283.2,1260.05,1381290,1773221519.75,31-DEC-2009,
ABB,EQ,756.2,770.85,756.2,767.1,769.55,756.3,292376,223660807.4,31-DEC-2009,
ABCIL,EQ,85.4,89,84.9,86.85,86.95,84.7,59183,5170993.2,31-DEC-2009,

Parfait · Accepted Answer · 2018-05-01 16:52:21Z

0

Consider using pandas methods to iteratively check columns and run import instead of scanning first lines with csv. Also, use os to manage the file names extract and locations with shutil for moving done files. Below builds a list of dataframes for final concatenation outside loop.

import os, shutil
import pandas as pd

def import_csvs(csv_file):

    path = r'/path/to/csv/files'
    csv_files = sorted([f for f in os.listdir(path) if f[-3:] == 'csv'])

    # INITIALIZE DATAFRAME LIST
    df_list = []
    # READ FIRST DF (ASSUMED FIRST IN ALPHABETICAL ORDER)
    first_df = pd.read_csv(os.path.join(path, csv_files[0]))
    # APPEND FIRST DF   
    df_list.append(first_df)

    # MOVE FIRST CSV
    shutil.move(os.path.join(path, csv_files[0]), os.path.join(path,'done',csv_files[0]))

    # LOOP ALL OTHER CSVs SKIPPING FIRST
    for f in csv_files[1:]:                 
       # IMPORT CSV
       tmp = pd.read_csv(os.path.join(path, f))

       # CHECK DF COLUMNS EXACTLY MATCH
       if list(tmp.columns) == list(first_df.columns):  
          # APPEND DF TO LIST
          df_list.append(tmp)

          # MOVE COMPLETED FILE
          shutil.move(os.path.join(path, f), os.path.join(path, 'done', f))

    final_df = pd.concat(df_list)

    return final_df

answered May 1, 2018 at 16:52

Parfait

108k19 gold badges102 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Pal S Over a year ago

Worked exactly as requirement. After exporting the final_df to a out.csv I see the header of last column says Unnamed: 11. Here are the coulmns in the file: SYMBOL SERIES OPEN HIGH LOW CLOSE LAST PREVCLOSE TOTTRDQTY TOTTRDVAL TIMESTAMP Unnamed: 11

Parfait Over a year ago

Check your data. Do you have unnamed column 11 in first csv file?

Pal S Over a year ago

I don't see any unnamed column in the data. I have uploaded the final out.csv as a zip file on the grdive folder in case you want to take a look at it drive.google.com/open?id=1_1jgRS2IrnHJ_ogHYyVuRzgEnmpuQave

Pal S Over a year ago

It seems unnamed:11 is the number of header that are there in the files. I opened up all 237 files manually & verified there is no other col. I also ran the code with smaller number of files 1) with 1 file & 2) with 4 files. Every time the out.csv had the unnamed:11 even with random files.

Pal S Over a year ago

I am using following to export out.csv <<<< --- final_df.to_csv(dir_path + 'out.csv', encoding='utf-8', index=False) --->>>>

|

Collectives™ on Stack Overflow

Python | Combine multiple csv (100+) files from one folder taking csv header into consideration

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related