Mongodb - Multiple text index: Index key pattern too large error code 67

Question

I have the following Mongodb database structure:

{ 
    "_id" : "519817e508a16b447c00020e", 
    "keyword" : "Just an example query", 
    "rankings" : 
    {
        results:
        {
            "1" : { "domain" : "example1.com", "href" : "http://www.example1.com/"},
            "2" : { "domain" : "example2.com", "href" : "http://www.example2.com/"},
            "3" : { "domain" : "example3.com", "href" : "http://www.example3.com/"},
            "4" : { "domain" : "example4.com", "href" : "http://www.example4.com/"},
            "5" : { "domain" : "example5.com", "href" : "http://www.example5.com/"},
            ...
            ...
            "99" : { "domain" : "example99.com", "href" : "http://www.example99.com/"}
            "100" : {"domain" : "example100.com", "href" : "http://www.example100.com/"}
        }, 
        "plus":"many", 
        "other":"not", 
        "interesting" : "stuff", 
        "for": "this question"
    }
}

In a previous question, I asked how to index the text so that I could search for the keyword and domain using for example:

db.ranking.find({ $text: { $search: "\"example9.com\" \"Just an example query\""}})

The awesome answer by John Petrone was:

db.ranking.ensureIndex(
{
    "keyword": "text",
    "rankings.results.1.domain" : "text",
    "rankings.results.2.domain" : "text",
    ...
    ...
    "rankings.results.99.domain" : "text",
    "rankings.results.100.domain" : "text"
}

However, if that works just great when I have 10 results, I run into an "Index key pattern too large" error with code 67 from Mongo shell when I try to index 100 results.

So the big question is:

How (the hell) can I resolve that "index key pattern too large" error?

EDIT: 18/08/2014 The document structure clarified

{ 
    "_id" : "519817e508a16b447c00020e", #From Mongodb
    "keyword" : "Just an example query", 
    "date" : "2014-03-28"
    "rankings" :
    {
            "1" : { "domain" : "example1.com", "href" : "http://www.example1.com/", "plus" : "stuff1"},
            ...
            "100" : {"domain" : "example100.com", "href" : "http://www.example100.com/"plus" : "stuff100"}"}
    }, 
    "plus":"many", 
    "other":"not", 
    "interesting" : "stuff", 
    "for": "this question"
}

It ocurred to me when I saw the original question that this would quickly become a problem. Is there no way you can change the structure? — John Powell
– John Powell, Commented Aug 16, 2014 at 6:52
Well, that's a json received from an external API. But, if I changed the structure, what would you recommend and how could I setup the index? — Antoine Brunel
– Antoine Brunel, Commented Aug 16, 2014 at 10:13
Do you want to reference both domain and href in your queries? With equal weight? Presumably the _id comes from each set of searches, so you could have multiple documents each with a difference _id and an array of results, with href and domain inside? I just want to be clear, before offering an answer. — John Powell
– John Powell, Commented Aug 17, 2014 at 10:59
What is the use case? I feel like a different data model is the best way to proceed here. — wdberkeley
– wdberkeley, Commented Aug 18, 2014 at 15:52
Exact matches as in your examples? Or keyword/text search matches? Text search on domain and href fields or just the keyword field? — wdberkeley
– wdberkeley, Commented Aug 18, 2014 at 18:30

Community · Accepted Answer · 2017-05-23 11:46:11Z

1

The problem with your suggested structure:

{
 keyword" : "Just an example query", 
 "rankings" :
    [{"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
     ...{ "rank" : 99, "domain" : "example99.com", "href" : "example99.com“}
 ]}
}

Is that although you can now do

db.ranking.ensureIndex({"rankings.href":"text", "rankings.domain":"text"})

and then run queries like:

db.ranking.find({$text:{$search:"example1"}});

this will now return the whole array document where the array element is matched.

You might want to consider referencing so that each rankings result is a separate document and the keywords and other meta data are referenced, to avoid repeating lots of information.

So, you have a keyword/metadata document like:

{_id:1, "keyword":"example query", "querydate": date, "other stuff":"other meta data"},
{_id:2, "keyword":"example query 2", "querydate": date, "other stuff":"other meta data 2"}

and then a results document like:

{keyword_id:1, {"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
... keyword_id:1, {"rank" : 99, "domain" : "example99.com", "href" : "example99.com"},
 keyword_id:2, {"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
 ...keyword_id:2, {"rank" : 99, "domain" : "example99.com", "href" : "example99.com"}}

where keyword_id links back to (references) the keyword/metadata table -- obviously, in practice, the _ids will look like "_id" : "519817e508a16b447c00020e", but this is just for readability. You could now index on keyword_id, domain and href, either together or separately, depending on your query types and you will not get the index key pattern too large error and you will only get a single matching document rather than a whole array returned.

I am not entirely clear on where you are needing fuzzy/regex style searches and whether you will be searching metadata or just href and domain, but I think this structure should be a cleaner way to start thinking about indexing, without maxing out on indexes, as before. It will also allow you to combine finds on normal indexes with text indexes, depending on your query pattern.

You might find this answer MongoDB relationships: embed or reference? useful when considering you document struture.

edited May 23, 2017 at 11:46

CommunityBot

11 silver badge

answered Aug 17, 2014 at 10:59

John Powell

12.6k6 gold badges64 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Antoine Brunel Over a year ago

Thanks, it's a very interesting answer (a pity I can't vote yet). The reference article you posted is great and I actually did not consider splitting. But why not an even flatter structure for results document, like:

{keyword_id:1, "rank" : 1, "domain" : "example1-1.com", "href" : "example1-1.com"} ... {keyword_id:1, "rank" : 99, "domain" : "example1-99.com", "href" : "example1-99.com"}, {keyword_id:2, "rank" : 1, "domain" : "example2-1.com", "href" : "example2-1.com"} ... {keyword_id:1, "rank" : 99, "domain" : "example2-99.com", "href" : "example2-99.com"}}

Antoine Brunel Over a year ago

And this is a very interesting article too seanhess.github.io/2012/02/01/mongodb_relational.html

John Powell Over a year ago

Yes, you could go for a totally flat structure like that. I was thinking you might want to keep meta data separately around each set of requests to the api, but the flat structure will certainly work and be easy to index and query.

John Powell Over a year ago

Very good article. More or less what I was hinting at with the comment about querying arrays

Antoine Brunel · Accepted Answer · 2014-10-21 09:08:07Z

So, that's my solution: I decided to stick with the embedded document with an overly simple modification: Replacing dictionary keys containing the actual rank by an array containing the rank and that's it:

{ 
  "_id" : "519817e508a16b447c00020e", #From Mongodb
  "keyword" : "Just an example query", 
  "date" : "2014-03-28"
  "rankings" :
  [
    { 
      "domain" : "example1.com", "href" : "http://www.example1.com/", "plus" : "stuff1", "rank" : 1
    },
    ...
    {
      "domain" : "example100.com", "href" : "http://www.example100.com/"plus" : "stuff100", "rank" : 100
    }
  ]
  "plus":"many", 
  "more":"uninteresting", 
  "stuff" : "for", 
  "this": "question"
}

Then, I can select an entire document using for example:

> db.ranking.find({"keyword":"how are you doing", "rank_date" : "2014-08-27”)

Or a single result by using projections which is just awesome and a new feature in Mongodb 2.6 :-D

> db.collection.find({ "rank_date" : "2014-04-09", "rankings.href": "http://www.example100.com/" }, { "rankings.$": 1 })

  [
    { 
      "domain" : "example100.com", "href" : "http://www.example100.com/", "plus" : "stuff100", "rank" : 100
    },
  ]

And even get one single url rank directly:

> db.collection.find({"rank_date" : "2014-04-09", "rankings.href": "http://www.example5.com/"}, { "rankings.$": 1 })[0]['rankings'][0]['rank']
5

And finally, I'm also creating an index based on the url:

> db.collection.ensureIndex( {"rankings.href" : "text"} )

With the index, I can either search for a single url, a partial url, a subdomain or the entire domain so that's just great:

> db.collection.find({ $text: { $search: "example5.com"}})

And that's it really! Thanks a lot for everyone's help, especially @JohnBarça :-D

Collectives™ on Stack Overflow

Mongodb - Multiple text index: Index key pattern too large error code 67

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related