Partitioning MongoDB Data on the Fly

Let’s say you start a project with MongoDB. It is (and probably should be) simple and small. It starts getting some traction and expands. Your single server is now behind a load balancer and your EC2 instance sizes are getting larger and larger.

Someone reminds you to start backing up your data, and you switch to run replica sets. People are using and loving your service, and downtime isn’t an option. Your mongodump snapshots are getting huge, and it’s clear that you’re going to have data scale issues.

Browsing any of the hundreds of articles about MongoDB can teach you a ton of best practices from people who have been to war with growth. But what do you do when your system starts getting too big to change without a day of downtime?

Let’s put some things in perspective:

  • Copying 100gb of data on AWS will take hours
  • Writing 100gb of data with mongodump will take hours
  • Restoring 100gb of data with mongorestore could take many, many hours, depending on the index rebuilding process
  • It’s possible to have inconsistent data when performing mongodump, as records could be added as you’re writing files
  • If you’re on AWS + EBS, you can use the awesome EBS snapshot feature, but to do so, you need to lock the database during the snapshot process, which could take hours

To partition your data with the standard MongoDB toolset, significant downtime is unavoidable. You’ll either need to write a bunch of application logic, or get creative with some third party tools. This is a problem that we’ve hit at Reverb more than once, and are the exact same tools + technique that we used to migrate across datacenters (see From the Cloud and Back).

Now let’s assume that your data growth is such that you need to split up your database into many, smaller databases. This is a very common thing to do when optimizing MongoDB—smaller, discrete MongoDB instances are much easier to manage and scale, especially on cheap cloud servers. So now your app, which looked like this —


will look like this —


We again have a number of options. But let’s assume that downtime isn’t one of them, and you have a reasonably large MongoDB deployment, so the time to move data is going to be in the hours. Luckily there are mechanisms to work with the MongoDB oplog along with some (free!) open-source tools to facilitate this.

From the software point of view, we’re going to assume that the servers are going to be updated with new logic, one after the other, and that they’re behind a reasonable load balancer. We’ll also assume that during the updates, one server can handle the traffic and load, which should be a safe assumption, assuming you’ve been able to update the software without outages in the past.

So during the update period, there could be writes sent to the old service, which is writing to the old, single MongoDB cluster. After updating, there’s a period of time where both servers are writing before the second machine is updated.

For the database, we want to first create a new cluster to take the subset of data from our main cluster. There’s initially no data—this is easy, and well documented.

Now, getting the data ready. We’re going to do this in three steps—all apply just to the collections that we’re MOVING:

  1. Store all write operations to disk
  2. Perform a Mongodump on the old cluster to copy collections
  3. Restore the data with mongorestore
  4. Apply the write operations from disk
  5. Make a real-time replication of write operations from the old database to the new one

Let’s walk through each of these steps.

MongoDB has an oplog for all operations—this is how secondaries are kept in sync with the master server. The standard MongoDB java driver has the ability to read this oplog, and we at Wordnik have written some very simple utilities to work with the oplog.

Writing the oplog operations to disk simply means each write operation will be stored in a file. This is done as follows:



./bin/ com.wordnik.system.mongodb.IncrementalBackupUtil -c database_a.collection_1,database_a.collection_2 -o oplog-files -h

This will connect to your MongoDB cluster at IP address on port 27017 and start writing the operations on collection_1, collection_2 from the database named “database_a” into a folder called “oplog-files”. You’ll see a series of files written to that folder. Note that the entire oplog will be written to this directory, and you probably don’t need this.

To set the time range of files to write, you can simply create a file in the target directory called “last_timestamp.txt”. In that folder, put the unix-timestamp (in seconds) that you want to read from in the target directory:

cat oplog-files/last_timestamp.txt

Note the “|0” at the end

Keep this process running!

Now dump the collections you want. This is easy—just use the standard mongodump command:

mongodump -c collection_1
mongodump -c collection_2

This will write the two collections that we’re interested in into a folder at ./dump/database_a/

Remember that because we’re writing all operations to disk, we’re safe for the delay in writing the oplog.

Now we restore those dump files into our new database. Again standard mongodb administration technique:

mongorestore -d database_a ./dump/database_a

We’re almost there. You can now stop the oplog copying and write the files into the new server. The graceful way to stop the oplog tail program is to create a stop file —

touch stop.txt

— and it’ll flush all data to disk then exit. Note that in the output directory (oplog-files) the last_timestamp.txt file has been updated with the last record it read from the server—don’t delete this file, we’ll need it later. We can now restore those files to the new server:

./bin/ com.wordnik.system.mongodb.ReplayUtil -i oplog-files -h localhost:27017

This will apply—in order—the files that we wrote during the oplog tailing process. Restoring these files will apply all operations to the timestamp that you stopped the file copy process. It can take a little bit to run, depending on how much data you have.

Finally, we’re ready to complete the last step, which is syncing the remaining and future operations from the old master. First, you want to copy the last_timestamp.txt file into the working directory of the oss-tools:

cp oplog-files/last_timestamp.txt .

Then start replicating all operations from that timestamp onwards to the new server:

./bin/ com.wordnik.system.mongodb.ReplicationUtil -h -H localhost:27017 -c database_a.collection_1,database_a.collection_2

Voila! You have now made a exact copy of data from one cluster to another. You should now run some queries on both of the databases to ensure replication is working correctly. You should also verify collection counts between instances.

Finally, we’re ready to update our application. You can do this by simply shutting down the services one at a time and updating them. You should verify that your application logic is writing to the new cluster. Once done, stop the replication utility! You’re now safe to drop the collections that you’ve moved to the new server.

Using replication utilities creates a number of opportunities for interesting deployments. If you look at the source code for the utilities on github:

You’ll see it’s very easy to modify them. So you could, for instance, replicate production data to a development environment, and anonymize email address in the process.

I hope that’s helpful and that it lets you continue to grow your MongoDB deployment without downtime!

2 thoughts on “Partitioning MongoDB Data on the Fly

Leave a Reply

Your email address will not be published. Required fields are marked *