7

In the pandas documentation, it states:

It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a signifcant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ]

result = pd.concat(frames)

My current situation is that I will be concatenating a new dataframe to a growing list of data frames over and over. This will result in a horrifying number of concatenations.

I'm worried about performance, and I'm not sure how to make use of list comprehension in this case. My code is as follows.

df = first_data_frame
while verify == True:
    # download data (new data becomes available through each iteration)
    # then turn [new] data into data frame, called 'temp'
    frames = [df, temp]
    df = concat(frames)
    if condition_met:
        verify == False

I don't think the parts that download data and create the data frame are relevant; my concern is with the constant concatenation.

How do I implement list comprehension in this case?

2 Answers 2

5

List comprehension is very fast and elegant. I also had to chain together many different dataframes from a list. This is my code:

import os
import pandas as pd
import numpy as np

# FileNames is a list with the names of the csv files contained in the 'dataset' path

FileNames = []
for files in os.listdir("dataset"):
    if files.endswith(".csv"):
        FileNames.append(files)

# function that reads the file from the FileNames list and makes it become a dataFrame

def GetFile(fnombre):
location = 'dataset/' + fnombre
df = pd.read_csv(location)
return df

# list comprehension
df = [GetFile(file) for file in FileNames]
dftot = pd.concat(df)

The result is a dataFrame of over one million rows (8 columns) created in 3 seconds, on my i3.

if you replace the two lines of code "list comprehension" with these, you will notice a deterioration in performance:

dftot = pd.DataFrame()
for file in FileNames:
    df = GetFile(file)
    dftot = pd.concat([dftot, df])

to insert an 'IF' condition to your code, change the line:

df = [GetFile(file) for file in FileNames]

in this way for example:

df = [GetFile(file) for file in FileNames if file == 'A.csv']

this code reads the 'A.csv' file only

Sign up to request clarification or add additional context in comments.

Comments

4

If you have a loop that can't be put into a list comprehension (like a while loop), you can initialize an empty list at the top, then append to it during the while loop. Example:

frames = []
while verify:
    # download data
    # temp = pd.DataFrame(data)
    frames.append(temp)
    if condition_met:
        verify = False

pd.concat(frames)

You can also put the loop in a generator function, and then use a list comprehension, but that might be more complicated than you need.

Also, if your data comes naturally as a list of dicts or something like that, you may not need to create all the temporary dataframes - just append all of your data into one giant list of dicts, and then convert that to a dataframe in one call at the very end.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.