5

I have the following Mongodb database structure:

{ 
    "_id" : "519817e508a16b447c00020e", 
    "keyword" : "Just an example query", 
    "rankings" : 
    {
        results:
        {
            "1" : { "domain" : "example1.com", "href" : "http://www.example1.com/"},
            "2" : { "domain" : "example2.com", "href" : "http://www.example2.com/"},
            "3" : { "domain" : "example3.com", "href" : "http://www.example3.com/"},
            "4" : { "domain" : "example4.com", "href" : "http://www.example4.com/"},
            "5" : { "domain" : "example5.com", "href" : "http://www.example5.com/"},
            ...
            ...
            "99" : { "domain" : "example99.com", "href" : "http://www.example99.com/"}
            "100" : {"domain" : "example100.com", "href" : "http://www.example100.com/"}
        }, 
        "plus":"many", 
        "other":"not", 
        "interesting" : "stuff", 
        "for": "this question"
    }
}

In a previous question, I asked how to index the text so that I could search for the keyword and domain using for example:

db.ranking.find({ $text: { $search: "\"example9.com\" \"Just an example query\""}})  

The awesome answer by John Petrone was:

db.ranking.ensureIndex(
{
    "keyword": "text",
    "rankings.results.1.domain" : "text",
    "rankings.results.2.domain" : "text",
    ...
    ...
    "rankings.results.99.domain" : "text",
    "rankings.results.100.domain" : "text"
}

However, if that works just great when I have 10 results, I run into an "Index key pattern too large" error with code 67 from Mongo shell when I try to index 100 results.

So the big question is:

How (the hell) can I resolve that "index key pattern too large" error?


EDIT: 18/08/2014 The document structure clarified

{ 
    "_id" : "519817e508a16b447c00020e", #From Mongodb
    "keyword" : "Just an example query", 
    "date" : "2014-03-28"
    "rankings" :
    {
            "1" : { "domain" : "example1.com", "href" : "http://www.example1.com/", "plus" : "stuff1"},
            ...
            "100" : {"domain" : "example100.com", "href" : "http://www.example100.com/"plus" : "stuff100"}"}
    }, 
    "plus":"many", 
    "other":"not", 
    "interesting" : "stuff", 
    "for": "this question"
}
10
  • 1
    It ocurred to me when I saw the original question that this would quickly become a problem. Is there no way you can change the structure? Commented Aug 16, 2014 at 6:52
  • Well, that's a json received from an external API. But, if I changed the structure, what would you recommend and how could I setup the index? Commented Aug 16, 2014 at 10:13
  • 1
    Do you want to reference both domain and href in your queries? With equal weight? Presumably the _id comes from each set of searches, so you could have multiple documents each with a difference _id and an array of results, with href and domain inside? I just want to be clear, before offering an answer. Commented Aug 17, 2014 at 10:59
  • 1
    What is the use case? I feel like a different data model is the best way to proceed here. Commented Aug 18, 2014 at 15:52
  • 1
    Exact matches as in your examples? Or keyword/text search matches? Text search on domain and href fields or just the keyword field? Commented Aug 18, 2014 at 18:30

2 Answers 2

1

The problem with your suggested structure:

{
 keyword" : "Just an example query", 
 "rankings" :
    [{"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
     ...{ "rank" : 99, "domain" : "example99.com", "href" : "example99.com“}
 ]}
}

Is that although you can now do

db.ranking.ensureIndex({"rankings.href":"text", "rankings.domain":"text"}) 

and then run queries like:

db.ranking.find({$text:{$search:"example1"}});

this will now return the whole array document where the array element is matched.

You might want to consider referencing so that each rankings result is a separate document and the keywords and other meta data are referenced, to avoid repeating lots of information.

So, you have a keyword/metadata document like:

{_id:1, "keyword":"example query", "querydate": date, "other stuff":"other meta data"},
{_id:2, "keyword":"example query 2", "querydate": date, "other stuff":"other meta data 2"}

and then a results document like:

{keyword_id:1, {"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
... keyword_id:1, {"rank" : 99, "domain" : "example99.com", "href" : "example99.com"},
 keyword_id:2, {"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
 ...keyword_id:2, {"rank" : 99, "domain" : "example99.com", "href" : "example99.com"}}

where keyword_id links back to (references) the keyword/metadata table -- obviously, in practice, the _ids will look like "_id" : "519817e508a16b447c00020e", but this is just for readability. You could now index on keyword_id, domain and href, either together or separately, depending on your query types and you will not get the index key pattern too large error and you will only get a single matching document rather than a whole array returned.

I am not entirely clear on where you are needing fuzzy/regex style searches and whether you will be searching metadata or just href and domain, but I think this structure should be a cleaner way to start thinking about indexing, without maxing out on indexes, as before. It will also allow you to combine finds on normal indexes with text indexes, depending on your query pattern.

You might find this answer MongoDB relationships: embed or reference? useful when considering you document struture.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, it's a very interesting answer (a pity I can't vote yet). The reference article you posted is great and I actually did not consider splitting. But why not an even flatter structure for results document, like: {keyword_id:1, "rank" : 1, "domain" : "example1-1.com", "href" : "example1-1.com"} ... {keyword_id:1, "rank" : 99, "domain" : "example1-99.com", "href" : "example1-99.com"}, {keyword_id:2, "rank" : 1, "domain" : "example2-1.com", "href" : "example2-1.com"} ... {keyword_id:1, "rank" : 99, "domain" : "example2-99.com", "href" : "example2-99.com"}}
And this is a very interesting article too seanhess.github.io/2012/02/01/mongodb_relational.html
Yes, you could go for a totally flat structure like that. I was thinking you might want to keep meta data separately around each set of requests to the api, but the flat structure will certainly work and be easy to index and query.
Very good article. More or less what I was hinting at with the comment about querying arrays
1

So, that's my solution: I decided to stick with the embedded document with an overly simple modification: Replacing dictionary keys containing the actual rank by an array containing the rank and that's it:

{ 
  "_id" : "519817e508a16b447c00020e", #From Mongodb
  "keyword" : "Just an example query", 
  "date" : "2014-03-28"
  "rankings" :
  [
    { 
      "domain" : "example1.com", "href" : "http://www.example1.com/", "plus" : "stuff1", "rank" : 1
    },
    ...
    {
      "domain" : "example100.com", "href" : "http://www.example100.com/"plus" : "stuff100", "rank" : 100
    }
  ]
  "plus":"many", 
  "more":"uninteresting", 
  "stuff" : "for", 
  "this": "question"
}

Then, I can select an entire document using for example:

> db.ranking.find({"keyword":"how are you doing", "rank_date" : "2014-08-27”)

Or a single result by using projections which is just awesome and a new feature in Mongodb 2.6 :-D

> db.collection.find({ "rank_date" : "2014-04-09", "rankings.href": "http://www.example100.com/" }, { "rankings.$": 1 })

  [
    { 
      "domain" : "example100.com", "href" : "http://www.example100.com/", "plus" : "stuff100", "rank" : 100
    },
  ]

And even get one single url rank directly:

> db.collection.find({"rank_date" : "2014-04-09", "rankings.href": "http://www.example5.com/"}, { "rankings.$": 1 })[0]['rankings'][0]['rank']
5

And finally, I'm also creating an index based on the url:

> db.collection.ensureIndex( {"rankings.href" : "text"} )

With the index, I can either search for a single url, a partial url, a subdomain or the entire domain so that's just great:

> db.collection.find({ $text: { $search: "example5.com"}})

And that's it really! Thanks a lot for everyone's help, especially @JohnBarça :-D

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.