Performance optimization (python): Speeding up .append() to pandas DataFrame

Ask Question

Asked 7 years, 5 months ago

Modified 7 years, 5 months ago

Viewed 2k times

I have a very large dataset in a mongoDB which is queried and appended to a resulting DataFrame.

for tree in db.im_tree_active.find({"date" : { '$gte' : startdate , 
'$lte' : enddate },"depth" : {'$gte' : 1, '$lte' : 4}, no_cursor_timeout = True).batch_size(1500):
    if count % 1000 == 0:
        print(count, tot)
    #keyFill(keylist, tree)  <-- added to compensate for mismatched columns
    #im = im.append(tree)  <-- ran too slowly
    im.loc[count, :] = tree  <-- runs much faster but keyFill() slows down
    count+=1

Using pandas .append() function created a copy of the dataframe, which took far too long when the DataFrame became much larger.

I replaced the append statement with a .loc[] statement which I read should speed up the query a bit, however I receive a mismatched column error. This is because some of the trees that are iterated through in the MongoDB don't have some of the fields which other ones do have. I fixed this by adding a function keyFill() given by the following simple code:

def keyFill(keylist, tree):
    for key in keylist:
        if key not in tree.keys():
            tree[key] = ""
    return tree

However running this before every single .loc[] call causes the query to slow down nearly 1000% (estimated).

Is there a way to speed this whole process up? The query runs a lot quicker before it reaches about 50% through the dataset, and then continues to slow down to the point that the last 1000 trees that it appends takes nearly 10x as long to run as the first 1000.

asked Jun 28, 2018 at 15:59

Jack Walsh

5824 silver badges15 bronze badges

1

Why do you fill the DataFrame in a loop? Can't you just directly construct it from find-result? In general it is very slow to fill a DataFrame row by row

CodeZero
– CodeZero

2018-06-28 16:10:46 +00:00
Commented Jun 28, 2018 at 16:10
3

Constantly appending to a DataFrame within a loop is inefficient. Instead, you should append the DataFrames to a list within the loop and use a single pd.concat(list_of_dfs) after the loop

ALollz
– ALollz

2018-06-28 16:11:08 +00:00
Commented Jun 28, 2018 at 16:11
2

@ALollz your method works very quickly. However, pd.concat() would not work due to the trees actually being dictionaries, but simply putting im = pd.DataFrame(list_of_dicts) worked perfectly. Thanks! (I would mark your comment as answer but its a comment lol)

Jack Walsh
– Jack Walsh

2018-06-28 16:29:24 +00:00
Commented Jun 28, 2018 at 16:29
@JackWalsh Glad it worked! I went for the comment because I'd just be providing a less detailed answer than one that already exists like: stackoverflow.com/questions/31674557/…

ALollz
– ALollz

2018-06-28 16:32:31 +00:00
Commented Jun 28, 2018 at 16:32

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Performance optimization (python): Speeding up .append() to pandas DataFrame

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked