13

In my mongoDB book collection I have documents structured as follow :

/* 0 */
{
  "_id" : ObjectId("50485b89b30f1ea69110ff4c"),

  "publisher" : {
    "$ref" : "boohya",
    "$id" : "foo"
  },
  "displayName" : "Paris Nightlife",
  "catalogDescription" : "Some desc goes here",
  "languageCode" : "en",
  "rating" : 0,
  "status" : "LIVE",
  "thumbnailId" : ObjectId("50485b89b30f1ea69110ff4b"),
  "indexTokens" : ["Nightlife", "Paris"]
}

I perform the following regex query to find all documents having one indexToken starting with "Par" :

{ "indexTokens" : { "$regex" : "^Par" , "$options" : "i"}}

If I select only the indexTokens field to be returned like this :

{ "indexTokens" : 1}

The resulting DBObject is

{ "_id" : { "$oid" : "50485b89b30f1ea69110ff4c"} , "indexTokens" : [ "Nightlife" , "Paris"]}

What I would like to get is ONLY the token / tag that matched the regex (I don0t care about retrieving the document at this point, neither do I need all the tags of the matched document)

Is this a case for the new Aggregation Framework relesed under MongoDB v2.2. ?

If yes how do I modify my query so that the actual result would look like :

{ "indexTokens" : ["Paris", "Paradise River", "Parma" , etc ....]}

Bonus question (do you has teh codez) : How do I do it using the Java driver ?

For now my java looks like :

DBObject query = new BasicDBObject("indexTokens", java.util.regex.Pattern.compile("^"+filter+"", Pattern.CASE_INSENSITIVE));
    BasicDBObject fields = new BasicDBObject("indexTokens",1);
    DBCursor curs = getCollection()
                    .find(query, fields)
                    .sort( new BasicDBObject( "indexTokens" , 1 ))
                    .limit(maxSuggestionCount);

Thx :)

EDIT:

As per your answers I modified my JAVA code as follow :

BasicDBObject cmdBody = new BasicDBObject("aggregate", "Book"); 
    ArrayList<BasicDBObject> pipeline = new ArrayList<BasicDBObject>(); 

    BasicDBObject match = new BasicDBObject("$match", new BasicDBObject("indexTokens", java.util.regex.Pattern.compile("^"+titleFilter+"", Pattern.CASE_INSENSITIVE)));
    BasicDBObject unwind = new BasicDBObject("$unwind", "$indexTokens");
    BasicDBObject match2 = new BasicDBObject("$match", new BasicDBObject("indexTokens", java.util.regex.Pattern.compile("^"+titleFilter+"", Pattern.CASE_INSENSITIVE)));
    BasicDBObject groupFilters = new BasicDBObject("_id",null);
    groupFilters.append("indexTokens", new BasicDBObject( "$push", "$indexTokens"));
    BasicDBObject group = new BasicDBObject("$group", groupFilters);

    pipeline.add(match);
    pipeline.add(unwind);
    pipeline.add(match2);
    pipeline.add(group);

    cmdBody.put("pipeline", pipeline); 



    CommandResult res = getCollection().getDB().command(cmdBody);
    System.out.println(res);

Which outputs

{ "result" : [ { "_id" :  null  , "indexTokens" : [ "Paris"]}] , "ok" : 1.0}

This is genius !

Thanks a lot !

0

2 Answers 2

13

You could do this with the 2.2 aggregation framework. Something like this;

db.books.runCommand("aggregate", {
    pipeline: [
        {   // find docs that contain Par*
            $match: { "indexTokens" : { "$regex" : "^Par" , "$options" : "i"}},
        },
        {   // create a doc with a single array elemm for each indexToken entry
            $unwind: "$indexTokens" 
        },
        {   // now produce a list of index tokens
            $group: {
                _id: "$indexTokens",
            },
        },
    ],
})

Or this might be even closer to what you're after if you really want the array without the doc;

db.books.runCommand("aggregate", {
    pipeline: [
        {   // find docs that contain Par*
            $match: { "indexTokens" : { "$regex" : "^Par" , "$options" : "i"}},
        },
        {   // create a doc with a single array elemm for each indexToken entry
            $unwind: "$indexTokens" 
        },
        {   // now throw out any unwind's that DON'T contain Par*
            $match: { "indexTokens": { "$regex": "^Par", "$options": "i" } },
        },
        {   // now produce the list of index tokens
            $group: {
                _id: null,
                indexTokens: { $push: "$indexTokens" },
            },
        },
    ],
})
Sign up to request clarification or add additional context in comments.

2 Comments

You can add this to your original answer as a second solution. This way people won't be confused why there are two answers from you :)
Thanks to both of you it works like a charm. I added an answer to show how I did it in JAVA (I don't have the latest driver so I couldnt use the aggregate() method on DBCollection.
5

Building on the response from cirrus, I recommend doing the $unwind first to avoid the redundant $match. Something like:

db.books.aggregate(
    {$unwind:"$indexTokens"},
    {$match:{indexTokens:/^Par/}},
    {$group:{_id:null,indexTokens:{$push:"$indexTokens"}}
})

How do you do this in Java? You can use the DBCollection.aggregate(...) method of the MongoDB v2.9.0 driver. Each pipeline operator, eg. $unwind or $match, corresponds to a DBObject object.

2 Comments

Actually, I don't think that $match is redundant. The trouble with $unwind is that it'll have to create a massive set of docs in RAM and you want to reduce that set down as early as possible. The first $match ensures that we're ONLY working with docs that even have Par* in the indexTokens before we unwind them. The second $match then cuts that set down to just the ones we want. Remember, you want to get your $match in early to reduce the pipeline volume.
You're right. Weed out the non-matching documents, unwind the arrays, and then match again to weed out the documents that don't match the regular expression.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.