0

I'm new to python and I would like to extract rows from several csv (better tsv) files in one new excel file with a new column defining the source of the data.

My code for doing it just for one file is:

import pandas as pd

df = pd.read_csv('C:/Users/filename.tsv', names=['c1', 'c2', 'c3', 'c4'], delimiter='\t')

result = df.loc [(df['c2'].isin(['name']))]

result.to_excel(r'C:/Users/filenamenew.xlsx')

But how do I do it for several files? like filename1.tsv; filename2.tsv; filename3.tsv...

3
  • You can use glob or simply a for loop iterating over the names of your files. Commented Sep 23, 2022 at 12:50
  • Thanks for the comment. I had the same idea and a question how do I iterate over filenames? Commented Sep 23, 2022 at 13:01
  • Welcome to SO. The code fails to run because result_curr if not defined (among other things). Please try to get it to work for a single file first. Then we can help you with looping for multiple files. Commented Sep 23, 2022 at 13:11

2 Answers 2

1

You can iterate through the files in a for loop, for each file read it into a dataframe, set a new column containing the source file name and then append it to a list. At the end use pd.concat() to concatenate all the dataframes into a single one and then save it as an excel sheet.

import pandas as pd

filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="\t")
    df["filename"] = filename
    dataframes.append(df)

pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")

If you need to filter the rows to keep from each dataframe you can do it before appending it to the list:

import pandas as pd

filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="\t")
    df["filename"] = filename
    df = df.loc[(df['c2'].isin(['name']))]  # here you can filter
    dataframes.append(df)

pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for the answer. I still have a question: How do I combine it with just taking few rows out of eacht file? if I would concat all the files just in one big dataframe and use df.loc then, the dataframe gets too big.
You can filter each dataframe right before appending it to the list! I will add a code example to my answer
So easy, when I see it!! ;) Thanks a lot.
Actually I tried to combin it with the answer of @Liutprand. for not writing the filenames manually. So I wrote filenames = os.listdir("C:/") filenames = [f for f in filenames if f.endswith("*.tsv.gz")] dataframes = [] for filename in filenames: df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="\t") df["filename"] = filename df = df.loc[(df['c2'].isin(['name']))] dataframes.append(df) pd.concat(dataframes).to_excel(r"C:/new.xlsx") and its not working anymore : ValueError: No objects to concatenate
How can I write code in the comments?
|
0

Assuming you know in advance the names of the tsv you can just put them in a list, loop on it and use the pd.concat() method to append them in the final df.

import pandas as pd

input_files=["filename1.tsv", "filename2.tsv", "filename3.tsv"]
col=["c1", "c2", "c3", "c4"]

final_df=pd.DataFrame(columns=col)

for i in input_files:
    df=pd.read_csv(i, delimiter="\t", columns=col)
    df["source"]=i
    final_df=pd.concat([final_df, df])

final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)

If you don't want to manually write the filenames in the list, you can grab them from a folder using the os module. Like that:

import pandas as pd
import os

input_files=os.listdir("C:/Path/To/The/Folder")
input_files=[f for f in input_files if f.endswith(".tsv")] #filter for tsv files only
col=["c1", "c2", "c3", "c4"]

final_df=pd.DataFrame(columns=col)

for i in input_files:
    df=pd.read_csv(i, delimiter="\t", columns=col)
    df["source"]=i
    final_df=pd.concat([final_df, df])

final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.