Extracting conditional data from multiple csv files

Question

I'm new to python and I would like to extract rows from several csv (better tsv) files in one new excel file with a new column defining the source of the data.

My code for doing it just for one file is:

import pandas as pd

df = pd.read_csv('C:/Users/filename.tsv', names=['c1', 'c2', 'c3', 'c4'], delimiter='\t')

result = df.loc [(df['c2'].isin(['name']))]

result.to_excel(r'C:/Users/filenamenew.xlsx')

But how do I do it for several files? like filename1.tsv; filename2.tsv; filename3.tsv...

You can use glob or simply a for loop iterating over the names of your files. — Luis Alejandro Vargas Ramos
– Luis Alejandro Vargas Ramos, Commented Sep 23, 2022 at 12:50
Thanks for the comment. I had the same idea and a question how do I iterate over filenames? — Chras
– Chras, Commented Sep 23, 2022 at 13:01
Welcome to SO. The code fails to run because result_curr if not defined (among other things). Please try to get it to work for a single file first. Then we can help you with looping for multiple files. — C. Pappy
– C. Pappy, Commented Sep 23, 2022 at 13:11

Matteo Zanoni · Accepted Answer · 2022-09-23 13:03:10Z

1

You can iterate through the files in a for loop, for each file read it into a dataframe, set a new column containing the source file name and then append it to a list. At the end use pd.concat() to concatenate all the dataframes into a single one and then save it as an excel sheet.

import pandas as pd

filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="\t")
    df["filename"] = filename
    dataframes.append(df)

pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")

If you need to filter the rows to keep from each dataframe you can do it before appending it to the list:

import pandas as pd

filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="\t")
    df["filename"] = filename
    df = df.loc[(df['c2'].isin(['name']))]  # here you can filter
    dataframes.append(df)

pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")

edited Sep 23, 2022 at 13:03

answered Sep 23, 2022 at 12:51

Matteo Zanoni

4,22214 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Chras Over a year ago

Thanks for the answer. I still have a question: How do I combine it with just taking few rows out of eacht file? if I would concat all the files just in one big dataframe and use df.loc then, the dataframe gets too big.

Matteo Zanoni Over a year ago

You can filter each dataframe right before appending it to the list! I will add a code example to my answer

Chras Over a year ago

So easy, when I see it!! ;) Thanks a lot.

Chras Over a year ago

Actually I tried to combin it with the answer of @Liutprand. for not writing the filenames manually. So I wrote filenames = os.listdir("C:/") filenames = [f for f in filenames if f.endswith("*.tsv.gz")] dataframes = [] for filename in filenames: df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="\t") df["filename"] = filename df = df.loc[(df['c2'].isin(['name']))] dataframes.append(df) pd.concat(dataframes).to_excel(r"C:/new.xlsx") and its not working anymore : ValueError: No objects to concatenate

Chras Over a year ago

How can I write code in the comments?

|

Liutprand · Accepted Answer · 2022-09-23 13:00:15Z

Assuming you know in advance the names of the tsv you can just put them in a list, loop on it and use the pd.concat() method to append them in the final df.

import pandas as pd

input_files=["filename1.tsv", "filename2.tsv", "filename3.tsv"]
col=["c1", "c2", "c3", "c4"]

final_df=pd.DataFrame(columns=col)

for i in input_files:
    df=pd.read_csv(i, delimiter="\t", columns=col)
    df["source"]=i
    final_df=pd.concat([final_df, df])

final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)

If you don't want to manually write the filenames in the list, you can grab them from a folder using the os module. Like that:

import pandas as pd
import os

input_files=os.listdir("C:/Path/To/The/Folder")
input_files=[f for f in input_files if f.endswith(".tsv")] #filter for tsv files only
col=["c1", "c2", "c3", "c4"]

final_df=pd.DataFrame(columns=col)

for i in input_files:
    df=pd.read_csv(i, delimiter="\t", columns=col)
    df["source"]=i
    final_df=pd.concat([final_df, df])

final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)

Collectives™ on Stack Overflow

Extracting conditional data from multiple csv files

2 Answers 2

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related