I am trying to model a database that needs a very high write throughput, and reasonable read throughput. I have a distributed set of systems that are adding "event" data into the database.
Currently, the id for the event record is a Guid. I have been reading that guids don't tend to create great indexes because their random distribution means that recent data will be scattered in the disk, which can lead to paging problems.
So here is the first assumption I would like to validate: I am assuming that I wan't to choose an _id that creates a right balanced tree, such as something like an autonumber. This would be beneficial because the 2 most recent events would essentially be right next to each other on disk. Is this a correct assumption?
Assuming that (1) is correct, then I am trying to work out the best way to generate such an id. I know Mongo natively supports ObjectId, which is convenient for applications that are ok tying their data to Mongo, but my application isn't such. Since there are multiple systems producing data, simulating an "auto-number" field is a little problematic because mongo doesn't support auto-number at the server side, so the producer would have to assign the id, which is hard if they don't know what the other systems are doing.
In order to solve for this, what I am considering doing is making the _id field a compound key on { localId, producerId } where local id is an autonumber that the producer can generate because producerId will make it unique. ProducerId is something that I can negotiate among producers so that they can come up with unique ids.
So here is my next question: If my goal is to get the most recent data from all producers, then { localId, producerId } should be the preferred key ordering since localId will be right-ist and producerId will be a small cluster, and I would prefer that the 2 most recent events stay local to each other. If I inverted that order, then my reasoning for how the tree would eventually look would be something like the following:
root
/ | \
p0 p1 p2
/ | \
e0..n e0..n e0..n
where p# is the producer Id, and e# is an event. This seems like it would fragment my index into p# clusters of data, and new events wouldn't necessarily be next to each other. My assumption for the preferred ordering should (please verify) look something like this instead:
root
/ | \
e0 e1 e2
/ | \
p0..n p0..n p0..n
which would seem to keep recent events near each other. ( I know that Mongo uses B-trees for indexes, but I am just trying to simplify the visual here ).
The only caveat to { localId, producerId } that I can see is that a common query by the user would be to list the most recent events by producer, which { producerId, localId } would actually handle much better. In order to get this query to work with { localId, producerId }, I am thinking that I will also need to add the producerId as a field to the document, and index that.
To be explicit about what my question here really is, I want to know if I am thinking about this problem correctly, or if there is an obviously better way to approach this.
Thanks