0

Requirement: I have a folder with multiple csv files. I need to perform following:

  1. Scan the input folder for all csv files (file1.csv, file2.csv ..... filen.csv etc) & perform below steps
  2. Open the first csv file (file1.csv) & store the file header in a list & then copy the entire content (including header) of file1.csv & put in combined.csv => Then move file1.csv to a /done folder under the parent folder
  3. Move to next file file2.csv => Compare the header of file2.csv with the header of the first file to ensure they match exactly. If the headers match then copy contents (excluding header) of file2.csv & put in combined.csv => Then move file2.csv to a /done folder under the parent folder. If the header of file2.csv do not match then exclude combining this file, leave it in the same parent folder & move to next file for combining

I have placed sample files on the link gdrive folder with sample CSV files I am open to any solution either with CSV or pandas as long as it does what I want

As a starting point, I was initially working on comparing the header as below which works. However, I am unsure on how to move further

Code:

    import csv        
    def compare_two_csv_headers(csv_file1, csv_file2):
        with open(csv_file1, newline='') as f:
            reader = csv.reader(f)
            frow1 = next(reader)  # gets the first line
            print(frow1)
        with open(csv_file2, newline='') as f:
            reader = csv.reader(f)
            frow2 = next(reader)  # gets the first line
            print(frow2)
        if frow1==frow2:
            print('Same header')
        else:
            print('Different header')

    csv_file1 = 'D:/2009/cm01JAN2009bhav.csv'
    csv_file2 = 'D:/2009/cm01DEC2009bhav.csv'
    compare_two_csv_headers(csv_file1, csv_file2)

Here are the top 10 lines from the first csv file

SYMBOL,SERIES,OPEN,HIGH,LOW,CLOSE,LAST,PREVCLOSE,TOTTRDQTY,TOTTRDVAL,TIMESTAMP,
20MICRONS,EQ,46.5,47,45.7,46.05,46,46.55,7092,328975.25,31-DEC-2009,
3IINFOTECH,EQ,85.8,86.7,84.5,85.15,85.35,85.05,2423812,207760480.3,31-DEC-2009,
3MINDIA,EQ,1855.05,1879.9,1855.05,1865.75,1874.95,1850.45,85,158679.1,31-DEC-2009,
AARTIDRUGS,EQ,107.4,108.75,103.65,104.45,104.9,106.05,84012,8929759.4,31-DEC-2009,
AARTIIND,EQ,51,51.9,48.9,49.2,49.1,50.45,149365,7517110.3,31-DEC-2009,
AARVEEDEN,EQ,64,64.5,63.05,63.85,63.1,62.7,2172,138651.5,31-DEC-2009,
ABAN,EQ,1265,1297,1265,1283.65,1283.2,1260.05,1381290,1773221519.75,31-DEC-2009,
ABB,EQ,756.2,770.85,756.2,767.1,769.55,756.3,292376,223660807.4,31-DEC-2009,
ABCIL,EQ,85.4,89,84.9,86.85,86.95,84.7,59183,5170993.2,31-DEC-2009,

1 Answer 1

0

Consider using pandas methods to iteratively check columns and run import instead of scanning first lines with csv. Also, use os to manage the file names extract and locations with shutil for moving done files. Below builds a list of dataframes for final concatenation outside loop.

import os, shutil
import pandas as pd

def import_csvs(csv_file):

    path = r'/path/to/csv/files'
    csv_files = sorted([f for f in os.listdir(path) if f[-3:] == 'csv'])

    # INITIALIZE DATAFRAME LIST
    df_list = []
    # READ FIRST DF (ASSUMED FIRST IN ALPHABETICAL ORDER)
    first_df = pd.read_csv(os.path.join(path, csv_files[0]))
    # APPEND FIRST DF   
    df_list.append(first_df)

    # MOVE FIRST CSV
    shutil.move(os.path.join(path, csv_files[0]), os.path.join(path,'done',csv_files[0]))

    # LOOP ALL OTHER CSVs SKIPPING FIRST
    for f in csv_files[1:]:                 
       # IMPORT CSV
       tmp = pd.read_csv(os.path.join(path, f))

       # CHECK DF COLUMNS EXACTLY MATCH
       if list(tmp.columns) == list(first_df.columns):  
          # APPEND DF TO LIST
          df_list.append(tmp)

          # MOVE COMPLETED FILE
          shutil.move(os.path.join(path, f), os.path.join(path, 'done', f))

    final_df = pd.concat(df_list)

    return final_df
Sign up to request clarification or add additional context in comments.

10 Comments

Worked exactly as requirement. After exporting the final_df to a out.csv I see the header of last column says Unnamed: 11. Here are the coulmns in the file: SYMBOL SERIES OPEN HIGH LOW CLOSE LAST PREVCLOSE TOTTRDQTY TOTTRDVAL TIMESTAMP Unnamed: 11
Check your data. Do you have unnamed column 11 in first csv file?
I don't see any unnamed column in the data. I have uploaded the final out.csv as a zip file on the grdive folder in case you want to take a look at it drive.google.com/open?id=1_1jgRS2IrnHJ_ogHYyVuRzgEnmpuQave
It seems unnamed:11 is the number of header that are there in the files. I opened up all 237 files manually & verified there is no other col. I also ran the code with smaller number of files 1) with 1 file & 2) with 4 files. Every time the out.csv had the unnamed:11 even with random files.
I am using following to export out.csv <<<< --- final_df.to_csv(dir_path + 'out.csv', encoding='utf-8', index=False) --->>>>
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.