Mongodb: data versioning with search

Question

Related to Ways to implement data versioning in MongoDB and structure of documents for versioning of a time series on mongodb

What data structure should I adopt for versioning when I also need to be able to handle queries?

Suppose I have 8500 documents of the form

{ _id: '12345-11',
  noFTEs: 5
}

Each month I get details of a change to noFTEs in about 30 docs, I want to store the new data along with the previous one(s), together with a date.

That would seem to result in:

{ _id: '12345-11',
  noFTEs: {
     '2015-10-28T00:00:00+01:00': 5,
     '2015-1-8T00:00:00+01:00': 3
  }
}

But I also want to be able to do searches on the most recent data (e.g. noFTEs > 4, and the element should be considered as 5, not 3). At that stage I all I know is I want to use the most recent data, and will not know the key. So an alternative would be an array

{ _id: '12345-11',
  noFTEs: [
     {date: '2015-10-28T00:00:00+01:00', val: 5},
     {date: '2015-1-8T00:00:00+01:00', val: 3}
  }
}

Another alternative - as suggested by @thomasbormans in the comments below - would be

{ _id: '12345-11',
  versions: [
     {noFTEs: 5, lastModified: '2015-10-28T00:00:00+01:00', other data...},
     {noFTEs: 3, lastModified: '2015-1-8T00:00:00+01:00', other...}
  }
}

I'd really appreciate some insights about considerations I need to make before jumping all the way in, I fear I am resulting in a query that is pretty high workload for Mongo. (In practise there are 3 other fields that can be combined for searching, and one of these is also likely to see changes over time.)

I recently implemented versioning by adding a versions array. When the document is updated, the unedited document is copied and pushed inside the versions array. And because my documents had a lastModified field, I am able to get all versions with the date that they were edited. — Thomas Bormans
– Thomas Bormans, Commented Nov 14, 2015 at 11:35
Are you able to search over the most recent entries of your data — Simon H
– Simon H, Commented Nov 14, 2015 at 11:36
I only query on the current document but you can $unwind the array and perform an aggregate function. — Thomas Bormans
– Thomas Bormans, Commented Nov 14, 2015 at 11:38

David Rissato Cruz · Accepted Answer · 2016-07-28 16:04:01Z

When you model a noSQL database, there are some things you need to keep in mind.

First of all is the size of each document. If you use arrays in your document, be sure that it won't pass the 16 Mb size limit for each document.

Second thing, you must model your database in order to retrieve things easily. Some "denormalization" is acceptable in favor of speed and easy of use to your application.

So if you need to know the current noFTE value, and you need to keep a history only to audit purposes, you could go with 2 collections:

collection["current"] = [
    {
        _id: '12345-11',
        noFTEs: 5, 
        lastModified: '2015-10-28T00:00:00+01:00'
    }
]

collection["history"] = [
    {   _id: ...an object id...
        source_id: '12345-11',
        noFTEs: 5, 
        lastModified: '2015-10-28T00:00:00+01:00'
    },
    {
        _id: ...an object id...
        source_id: '12345-11',
        noFTEs: 3, 
        lastModified: '2015-1-8T00:00:00+01:00'
    }
]

By doing this way, you keep your most frequent accessed records smaller (I suppose the current version is more frequently accessed). This will make mongo more prone to keep the "current" collection in memory cache. And documents will be retrieved faster from disk, because they are smaller.

I seem this design to be best in therms of memory optimisation. But this decision is directly related on what use you will make of your data.

EDIT: I changed my original response in order to create separated inserts for each history entry. In my original answer, I tried to keep your history entries close to your original solution to focus on denormalization topic. However, keeping history in an array is a poor design decision and I decided to make this answer more complete.

The choice to keep separated inserts in the history instead of creating an array are many:

1) Whenever you change the size of a document (for example, inserting more data into it), mongo may need to move this document to an empty part of your disk in order to accommodate the larger document. This way, you end up creating storage gaps making your collections larger.

2) Whenever you insert a new document, Mongo tries to predict how big it can become based on previous inserts/updates. This way, if your history documents' sizes are similar, the padding factor will become next to optimal. However, when you maintain growing arrays, this prediction won't be good and mongo will waste space with padding.

3) In the future, you will probably want to shrink your history collection if it grows too large. Usually, we define a policy for history retention (example: 5 years), and you can backup and prune data older than that. If you have kept separated documents for each history entry, it will be much easier to do this operation.

I can find other reasons, but I believe those 3 are enough to get into the point.

Philipp · Accepted Answer · 2015-11-14 12:01:38Z

1

To add versioning without compromising usability and speed of access for the most recent data, consider creating two collections: one with the most recent documents and one to archive the old versions of the documents when they get changed.

You can use currentVersionCollection.findAndModify to update a document while also receiving the previous (or new, depending on parameters) version of said document in one command. You then just need to remove the _id of the returned document, add a timestamp and/or revision number (when you don't have these already) and insert it into the archive collection.

By storing each old version in an own document you also avoid document growth and prevent documents from bursting the 16MB document limit when they get changed a lot.

edited Nov 14, 2015 at 12:01

answered Nov 14, 2015 at 11:50

Philipp

70.1k10 gold badges121 silver badges159 bronze badges

2 Comments

Simon H Over a year ago

Don't I have to store all docs related to same _id in one document for when I do want to look at time series? (That won't be a problem in this case.)

Philipp Over a year ago

@SimonH When you want to look at the history of a document, I would query for all the past versions of that document and then sort them by timestamp/revision number. Some field in the past version documents which says which current document they belong to would be a requirement for this, of course.

Collectives™ on Stack Overflow

Mongodb: data versioning with search

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related