5

This is a continuation of this question.

I'm using the following code to find all documents from collection C_a whose text contains the word StackOverflow and store them in another collection called C_b:

import pymongo
from pymongo import MongoClient
client = MongoClient('127.0.0.1')  # mongodb running locally
dbRead = client['C_a']            # using the test database in mongo
# create the pipeline required 
pipeline = [{"$match": {"$text": {"$search":"StackOverflow"}}},{"$out":"C_b"}]  # all attribute and operator need to quoted in pymongo
dbRead.C_a.aggregate(pipeline)  #execution 
print (dbRead.C_b.count()) ## verify count of the new collection 

This works great, however, if I run the same snippet for multiple keywords the results get overwritten. For example I want the collection C_b to contain all documents that contain the keywords StackOverflow, StackExchange, and Programming. To do so I simply iterate the snippet using the above keywords. But unfortunately, each iteration overwrites the previous.

Question: How do I update the output collection instead of overwriting it?

Plus: Is there a clever way to avoid duplicates, or do I have to check for duplicates afterwards?

3
  • 1
    $out will overwrite the collection if it exists. Why do you need to create new collections? Why can't the requirement be satisfied by querying the original collection instead? Commented May 29, 2018 at 23:40
  • @KevinAdistambha The above is a toy example. In truth, I have a very large collection of documents from which I want to extract all documents containing a keyword from a list of keywords (more than 200) and study them in various axes. To do so I want to create a collection with these specific documents. Is there now way of doing such a thing? Commented May 30, 2018 at 10:55
  • The nice "actual MongoDB employee" pointed you directly to the documentation that tells you that your "ask" is not possible. The only options are A. New collection using $out. B. Iterate results on a returned cursor and write updates back. Where of course B means transferring results and updates back "over the wire" which seems like what you are exactly trying to avoid. You should have paid attention to the very clear lesson.\ Commented Jun 1, 2018 at 12:24

1 Answer 1

2

If you look at the documentation $out doesn't support update

https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out

So you need to do a two stage operation

pipeline = [{"$match": {"$text": {"$search":"StackOverflow"}}},{"$out":"temp"}]  # all attribute and operator need to quoted in pymongo
dbRead.C_a.aggregate(pipeline)

and then use approach discussed in

https://stackoverflow.com/a/37433640/2830850

dbRead.C_b.insert(
   dbRead.temp.aggregate([]).toArray()
)

And before starting the run you will need to drop the C_b collection

Sign up to request clarification or add additional context in comments.

5 Comments

So the whole point of $out is to avoid the 16MB BSON limit. You propose to then read that whole collection into an insert() which also has that same 16MB limit. That's not going to work in any practical situation. Also that's not an "update" anyway.
Then only other way would be to somehow update your aggregation to handle multiple values instead of doing it one step at a time
Point is this is wrong. Hence the comment to let the poor person who did not understand the very clear documentation that this is indeed an incorrect answer.
@TarunLalwani dbRead.C_b.insert(dbRead.temp.aggregate([]).toArray()) returns a AttributeError: 'CommandCursor' object has no attribute 'toArray' error.
try dbRead.C_b.insert(list(dbRead.temp.aggregate([])))

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.