3

I have 3 collections in MongoDB that cannot have their schema changed. Some queries need to access the 3 collections.

I know that I need multiple queries to do this but I'm not sure what the most efficient method of doing this is. The folllowing example is simplified :

My data contains a "User" collection that serves as a logical parent to the other two collections. The other two collections are "DVD" and "CD". A user can have multiple CDs or DVDs

User Document 
id : "jim",
location : "sweden"

CD Document
name : "White Album",
owner : "jim"

DVD Document
name : "Fargo",
owner : "jim"

Now, the approach I am currently taking is as follows. If I want get back all of the CDs and DVDs for users in Sweden.

Step 1

Get all users in Sweden and return a cursor

Step 2

Iterate through the each user in the cursor and perform a lookup on both the DVD and CD collections to see if the users id matches the owner field

Step 3

If it does add the user to an array to be returned

This approach requires 2 additional queries and seems really inefficient to me. Is there a more efficient way of doing this?

2
  • Why are CD and DVD not in the same collection? Commented Nov 21, 2013 at 12:20
  • @Phillip Its just a simplified example Commented Nov 21, 2013 at 12:22

3 Answers 3

2

You can make some improvements on the query as follows.

  • While selecting users, return only the id field.

db.user.find({location:"sweden"},{id:1})

  • Create a String list that contains user names and pass those list using the $in query.Run $in query on cd & dvd collections as follows :
db.cd.find({owner : {$in : ["jim", "tom", ...]}})
db.dvd.find({owner : {$in : ["jim", "tom", ...]}})

Also add indexes on the collections to improve query performances.

Sign up to request clarification or add additional context in comments.

4 Comments

I've tried this approach before but was worried as the Users collection could contain a very large number of users. SO the array being used could contain 1 million+ users. Would an array this size make the query fall over?
AFAIK, there is no limit in the array size passed to the $in operator. The only limit here is the Bson document size (16 mb). If you have array size 1M, then you can run same query 10 times by passing 100K into the $in query. This is still better than running 1M query.
Hmm projection wont make the query more efficient, it will only shrink the amount of data on return. Also what do you mean by a string list of user names? How does that work?
I know projection does not make query more efficient, but here i try to mention the best practices during writing query. Also, by string list I mean create the list of user names and send those list into the $in as a prameter. This doesn't make any improvement on the query. I just try to explain how to write qurey.
0

It isn't as inefficient as it sounds.

You are most likely thinking of SQL techs whereby a result set is made each time you query and that is in turn cached on disk or in memory.

MongoDB streams directly from the data files every cursor batch which means its data is "live" from the database unlike a result set. This also means that pinging the odd query is also a lot lees resource intensive.

One option is, as you said, bring back all users and each iteration judge if they should be displayed since they have related records. This could evenly distribute the cursors stopping overloading however, there is still the possibility of cursor overload on the server.

One other option is to iterate all users from Sweden and get back a huge user_id array with which to query the CD and DVD collection. From there you would then match them up in your application and return as needed.

However, exactly how you solve this is upto your scenario and how much data you have.

Comments

0

If you can't change your schema, and you want know how many users from Sweden have a CD or DVD, then i think this is the smallest method:

  • users_ids * = Get all users_id from DVD and CD collections. Get all users that have itself id in * users_ids * and are from Sweden.

Then you are with only 2 queries but if your DVD and CD collections are giant enough this probably won't be faster than your method, even that this method only use 2 queries.

Keep in mind that lower number of queries doesn't mean necessarily faster.

Sorry for the english ;)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.