0

I am trying to model a database that needs a very high write throughput, and reasonable read throughput. I have a distributed set of systems that are adding "event" data into the database.

Currently, the id for the event record is a Guid. I have been reading that guids don't tend to create great indexes because their random distribution means that recent data will be scattered in the disk, which can lead to paging problems.

So here is the first assumption I would like to validate: I am assuming that I wan't to choose an _id that creates a right balanced tree, such as something like an autonumber. This would be beneficial because the 2 most recent events would essentially be right next to each other on disk. Is this a correct assumption?

Assuming that (1) is correct, then I am trying to work out the best way to generate such an id. I know Mongo natively supports ObjectId, which is convenient for applications that are ok tying their data to Mongo, but my application isn't such. Since there are multiple systems producing data, simulating an "auto-number" field is a little problematic because mongo doesn't support auto-number at the server side, so the producer would have to assign the id, which is hard if they don't know what the other systems are doing.

In order to solve for this, what I am considering doing is making the _id field a compound key on { localId, producerId } where local id is an autonumber that the producer can generate because producerId will make it unique. ProducerId is something that I can negotiate among producers so that they can come up with unique ids.

So here is my next question: If my goal is to get the most recent data from all producers, then { localId, producerId } should be the preferred key ordering since localId will be right-ist and producerId will be a small cluster, and I would prefer that the 2 most recent events stay local to each other. If I inverted that order, then my reasoning for how the tree would eventually look would be something like the following:

               root
        /        |           \
       p0        p1          p2
       /         |            \
     e0..n      e0..n        e0..n

where p# is the producer Id, and e# is an event. This seems like it would fragment my index into p# clusters of data, and new events wouldn't necessarily be next to each other. My assumption for the preferred ordering should (please verify) look something like this instead:

               root
      /          |          \
     e0          e1         e2
     /            |           \
  p0..n         p0..n        p0..n

which would seem to keep recent events near each other. ( I know that Mongo uses B-trees for indexes, but I am just trying to simplify the visual here ).

The only caveat to { localId, producerId } that I can see is that a common query by the user would be to list the most recent events by producer, which { producerId, localId } would actually handle much better. In order to get this query to work with { localId, producerId }, I am thinking that I will also need to add the producerId as a field to the document, and index that.

To be explicit about what my question here really is, I want to know if I am thinking about this problem correctly, or if there is an obviously better way to approach this.

Thanks

3
  • Just in case you are stating to develop a project, do consider Apache Cassandra. Its pretty amazing for heavy writes. Commented Nov 30, 2012 at 17:52
  • The application is already relatively mature and we already are somewhat committed to Mongo. I am just having to go through an optimization phase because we aren't quite keeping up due to disk thrashing. I will look into Cassandra though, thanks. Commented Nov 30, 2012 at 18:57
  • To be truthful, I am currently facing the similar problem. Even my app is quite mature and I feel that at some point mongo might not work as good as cassandra would have. Though mongo 2.2 is cool Commented Nov 30, 2012 at 19:07

1 Answer 1

1

To answer your question: a compound like this: {a,b} will end in scatter queries if you just query by b and then sort by a. but it will use the index for sorting.

If you use a Document instead of ObjectId, _id will be indexed but not used but it is not a compound index!

Example:

Given this Documents in Collection 'a' and no additional index:

{ "_id" : { "e" : 1, "p" : 1 } }
{ "_id" : { "e" : 1, "p" : 2 } }
{ "_id" : { "e" : 2, "p" : 1 } }
{ "_id" : { "e" : 1, "p" : 3 } }
{ "_id" : { "e" : 2, "p" : 3 } }
{ "_id" : { "e" : 2, "p" : 2 } }
{ "_id" : { "e" : 3, "p" : 1 } }
{ "_id" : { "e" : 3, "p" : 2 } }
{ "_id" : { "e" : 3, "p" : 3 } }

a query like this:

db.a.find({'_id.p' : 2}).sort({'_id.e' : 1}).explain()

will NOT use an index:

{
    "cursor" : "BasicCursor",
    "nscanned" : 9,
    "nscannedObjects" : 9,
    "n" : 3,
    "scanAndOrder" : true,
    "millis" : 0,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {   
    }
}

Just because the Documents are indexed.

If you create an index like this:

db.a.ensureIndex({'_id.e' : 1, '_id.p' : 1})

and then query again:

db.a.find({'_id.p' : 2}).sort({'_id.e' : 1}).explain()

{
    "cursor" : "BtreeCursor _id.e_1__id.p_1",
    "nscanned" : 9,
    "nscannedObjects" : 3,
    "n" : 3,
    "millis" : 0,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "_id.e" : [
            [
                {
                    "$minElement" : 1
                },
                {
                    "$maxElement" : 1
                }
            ]
        ],
        "_id.p" : [
            [
                2,
                2
            ]
        ]
    }
}

it will query on the index (nscanned: 9) because of the sort and then fetches the objects : 3, which is better than sorting by _id (nscanned and nscannedObjects would be 9).

Documentation .explain()

So for high write throughput (over 15k writes a sec) you would probably shard. Both Indexes would guarantee uniqueness if option isset. But only a compound shard key will help you for direct queries and no scatter gather.

Using ({'_id.e' : 1, '_id.p' : 1}) as a shard key will route all "_id.e" queries directly but not "_id.p" (without 'e') queries, so these queries will send to every host and end in index lookups there but could be fast aswell (depends ond network etc). If you want to cluster these queries by "p" you have to put '_id.p' as the first part of the compound key like so:

{'_id.p' : 1, '_id.e' : 1}

So all "p" queries are direct queries. But yes, this would scatter recent events across the cluster. So a separate index using the time based key might speed up those scatter queries.

I would generate me some sample data and would play around with it in a setup with two shards on a dev system and use .explain() for choosing the shard key + indexes.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your response. I actually came across the mongo indexing the document not the fields problem a bit later after I asked the question. It surprised me, and I wish there was a way for me to tell mongo that I want the _id index to be compound, maybe in the future. Thanks for the suggestions, I will play around with them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.