EC2: MongoDB Shard Cluster Consistent Backup using EBS snapshots and CPM Part I

Disclaimer: The scripts in this post are given as-is without any warranties`. Please consider them as guidelines. Please don’t use them in your production environment until thoroughly testing them and making sure the backup and recovery processes work correctly.

In a previous post I demonstrated how to back up a single MongoDB server consistently using EBS snapshots and Cloud Protection Manager (CPM). It’s an easy enough process, but for real MongoDB production installations, there usually isn’t one single server. To use MongoDB with a large amount of data, it is typically needed to scale horizontally, which manes to use a shard cluster. A shard cluster is a database distributed between multiple machines, each called a “shard.” Such a cluster typically contains multiple MongoDB shard servers, 3 Admin servers, which are all the same, and multiple MongoS servers, acting as the gateway for the distributed database.

Backing up such a complex environment consistently is a challenge. I followed the instructions in MongoDB’s official documentation. It is currently impossible to back up MongoDB with 100% consistency and with all servers on the exact same point-in-time without shutting down the whole cluster. The process shown in the documentation achieves an “approximate” consistency, since all shard servers need to be locked separately. However, if the shard servers are locked in a very short window of time, the end result should be consistent enough.

The backup process:

  1. Stop the load balancer (the process which migrates chunks of data between shard servers), then make sure to wait until any ongoing migration task completes. This is done on any of the MongoS servers.
  2. Flush & Lock all shard servers. Do it in a closest proximity of time. In production environments, a shard server is often not a single server but a replica set. In that case a single secondary member of any replica set is chosen.
  3. Stop the MongoDB service on one of the admin servers. This prevents any configuration changes to be made on the cluster during the backup process.
  4. After all is locked, the backup process needs to be quick: we need to back up all shard servers, or a secondary member of each shard replica set and we need to back up one of the admin servers (in a production environment there are typically 3, but they are all the same).
  5. After backup completes, we need to undo all the locking, i.e. start the admin server service, unlock all shard servers (or secondary members) and then enable the load balancer again. One huge advantage we have with EBS snapshots is they don’t need to compete in to allow unlocking the cluster, it’s enough that the snapshots start. The snapshot mechanism makes sure snapshots contain the exact image of the source volumes at the point-in-time when the snapshots started.

So, I wrote a python library, with a class called MongoCluster. All you need to do is declare an object of this type with a path to a configuration file describing the cluster members. Then all you need to do is call the methods of “lock” and “unlock” of this object and it will do the rest. After that I wrote two very short executable python scripts, using the library. One locks the cluster, the other one unlocks it.

Now, these scripts were written with EC2 instances and EBS snapshots in mind, and they are specifically tailored to work well with Cloud Protection Manager (CPM), but that said, they can be used in other environments. In the second part of the post I will show how to use these scripts and define a backup policy that backs up a complete MongoDB shard cluster in a few minutes. That way, one can easily perform this complicated task, even if there are replica members in different AWS regions.

Now, I had to make some design decisions while writing these scripts, and I wanted to explain them.

Design Points:

  1. Since this script is meant for CPM, I assume it runs on a machine that is not part of the cluster (i.e. the CPM Server) and connects remotely to all cluster members.
  2. All output from the scripts is done to stderr, CPM catches that output and stores it, to allow the user to monitor the process and debug any issues.
  3. I chose to connect to remote machines using SSH and running the mongo commands on the remote machines. I could have used a MongoDB client, and there’s an excellent library in Python, but I chose to use SSH for security reasons: mongo DB authentication credentials will pass on the wire, and there is no way I’ll do that in clear text. MongoDB supports connections over SSL, but in my understanding, not all installations on MongoDB have SSL enabled, and I’m not sure how wide this practice is, so I chose to avoid using it. Furthermore, in terms of ports and EC2/VPC security groups, it’s a good bet that SSH will be open for all incoming addresses, while mongo ports may be restricted. For an SSH client I chose the python library “paramiko”, and it needs to be present when using these scripts.
  4. In order to keep the locking time minimal, and in order for that minimal time to stay that way even if the MongoDB cluster has scale to many shards, I lock the shard servers in parallel using multithreading.
  5. The scripts need to be robust. If any of the shard servers locking fails, the script needs to rollback all other locks and declare the whole cluster locking as failed. Operations need to have a timeout, for instance, if there is a migration process of the load parser that will take a long time, the script needs to abort, fail the locking operations (timeout value is configurable). CPM will create the snapshots anyway, but will mark the backup as “not entirely successful.” It can later, according the backup policy, retry the whole backup.

Usage and configuration file:

The configuration file is basically a file in “ini” format, which describes the MongoDB shard cluster. All the options are in the README file with the source code, but in short there is a section defining how to interact with the MongoS server, a section for the Config Server (any one of the 3), and a section for each shard server (or replica set secondary member). Here is an example:

[MONGOS]

address=mongos.example.com

ssh_user=ubuntu

ssh_key_file=/cpmdata/scripts/mongosshkey.pem

mongo_port = 27017

mongo_user=user1

mongo_password=pass123

timeout_secs = 60

[CONFIGSERVER]

address=mongoconfig.example.com

ssh_user=ubuntu

ssh_key_file=/cpmdata/scripts/mongosshkey.pem

timeout_secs = 60

[SHARDSERVER_1]

address=mongoshard1.example.com

ssh_user=ubuntu

ssh_key_file=/cpmdata/scripts/mongosshkey.pem

mongo_port = 27018

mongo_user=user1

mongo_password=pass123

timeout_secs = 60

[SHARDSERVER_2]

address=mongoshard2.example.com

ssh_user=ubuntu

ssh_key_file=/cpmdata/scripts/mongosshkey.pem

mongo_port = 27018

mongo_user=user1

mongo_password=pass123

timeout_secs = 60

We need to define how to connect to each server using SSH, and then to define how to connect locally to MongoDB, except in the case of the config server, where the script just starts and stops the MongoDB service. The scripts using the library are very short and simple. Here is the locking script:

#!/usr/bin/python

import sys

from mongo_db_cluster_locker import *

locker = MongoCluster (‘/cpmdata/scripts/mongo_db_cluster_locker.cfg’)

success = locker.lock ()

if success:

   retval = 0

else:

   retval = -1

sys.exit (retval)

All I needed to do was to import the locker library, declare a MongoCluster objects (with its config file). And then call “lock ().” The script returns 0 if locking succeeded and -1 if it failed. The unlocking script is as simple:

#!/usr/bin/python

import sys

from mongo_db_cluster_locker import *

if sys.argv [1] == ‘0’:

 sys.stderr.write (‘After script does nothing since “before” script failed\n’)

   sys.exit (0)

locker = MongoCluster (‘/cpmdata/scripts/mongo_db_cluster_locker.cfg’)

success = locker.unlock ()

if success:

   retval = 0

else:

   retval = -1

sys.exit (retval)

The only added stage here is that the script checks the first command line argument, which states if the first (locking) script succeeded (1 – succeeded, o – failed). If the first one failed, then the second one doesn’t attempt to unlock anything. CPM automatically sends this argument when running backup scripts.

In the second part of the post, I’ll show how to easily define a backup policy in CPM that uses these scripts. You can download the scripts in a zip archive here.

For any questions, feedback, corrections or suggestions please email me at info@n2ws.com.

Follow N2W Software on LinkedIn

 

Share this post →

You might also like: