3

I have a very large dataset in a mongoDB which is queried and appended to a resulting DataFrame.

for tree in db.im_tree_active.find({"date" : { '$gte' : startdate , 
'$lte' : enddate },"depth" : {'$gte' : 1, '$lte' : 4}, no_cursor_timeout = True).batch_size(1500):
    if count % 1000 == 0:
        print(count, tot)
    #keyFill(keylist, tree)  <-- added to compensate for mismatched columns
    #im = im.append(tree)  <-- ran too slowly
    im.loc[count, :] = tree  <-- runs much faster but keyFill() slows down
    count+=1

Using pandas .append() function created a copy of the dataframe, which took far too long when the DataFrame became much larger.

I replaced the append statement with a .loc[] statement which I read should speed up the query a bit, however I receive a mismatched column error. This is because some of the trees that are iterated through in the MongoDB don't have some of the fields which other ones do have. I fixed this by adding a function keyFill() given by the following simple code:

def keyFill(keylist, tree):
    for key in keylist:
        if key not in tree.keys():
            tree[key] = ""
    return tree

However running this before every single .loc[] call causes the query to slow down nearly 1000% (estimated).

Is there a way to speed this whole process up? The query runs a lot quicker before it reaches about 50% through the dataset, and then continues to slow down to the point that the last 1000 trees that it appends takes nearly 10x as long to run as the first 1000.

4
  • 1
    Why do you fill the DataFrame in a loop? Can't you just directly construct it from find-result? In general it is very slow to fill a DataFrame row by row Commented Jun 28, 2018 at 16:10
  • 3
    Constantly appending to a DataFrame within a loop is inefficient. Instead, you should append the DataFrames to a list within the loop and use a single pd.concat(list_of_dfs) after the loop Commented Jun 28, 2018 at 16:11
  • 2
    @ALollz your method works very quickly. However, pd.concat() would not work due to the trees actually being dictionaries, but simply putting im = pd.DataFrame(list_of_dicts) worked perfectly. Thanks! (I would mark your comment as answer but its a comment lol) Commented Jun 28, 2018 at 16:29
  • @JackWalsh Glad it worked! I went for the comment because I'd just be providing a less detailed answer than one that already exists like: stackoverflow.com/questions/31674557/… Commented Jun 28, 2018 at 16:32

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.