fbpx

EC2: MongoDB Shard Cluster Consistent Backup using EBS snapshots and CPM Part II

Share This Post

Disclaimer: The scripts in this post are given as-is without any warranties`. Please consider them as guidelines. Please don’t use them in your production environment until thoroughly testing them and making sure the backup and recovery processes work correctly.

In the first part of this post, I presented a set of python scripts that lock and unlock a complete MongoDB shard cluster. So, using CPM, configuring backup for such a cluster takes only a few minutes. Basically, we need a backup policy that will include all the shards of the cluster. Assuming the shards are actually replications sets, it needs to back up a secondary member of each set. Furthermore, the backup policy needs to include one config server. Typically, in production environments, there will be three, but since they’re all the same, only one is needed in the backup policy. The MongoS server is needed for the locking process, but doesn’t necessarily need to be included in the backup, since it is stateless, unless you want to include it.

First stage: Define the Policy

mongo cluster CPM policy

I defined a new policy named “mongodb_cluster,” and associated it with a schedule. I can configure any schedule I need.

CPM mongodb cluster backup targets

Then, when selecting backup targets, I chose a config server and the two shard servers in the policy. Since the shard locking happens in parallel, locking a larger number of shards should take approx. the same time.

CPM mongodb cluster policy options

The last stage is to change the option of the policy (by clicking “More Options” :

Please note: I enabled backup scripts, increased the scripts timeout, since the locking (especially waiting for the load balancer to stop) may take some time. I chose 4 minutes; maybe even a larger value is needed. Furthermore, I chose to collect scripts output, which will give very good visibility of what the scripts are doing.

Setting the scripts

The scripts can be found here. To log in the CPM Server, I need an SSH client with an option to copy over files. I use a Windows system, with putty and WinSCP, there are other tools available. Instructions on how to connect to EC2 instances using Putty can be found here. I logged in as user “cpmuser” and copied the scripts to the folder “/cpmdata/scripts,” the names of the executables need to match “before_< policy name>,” “after_<policy name>” and the last script, which is an empty script that does nothing (but needs to be present), is called “complete_<policy name>.”

After copying the scripts (including the python module mongo_db_cluster_locker.py), I needed the SSH Python library “paramiko.” It is recommended to use it locally and to try to avoid actually installing anything on the CPM Server, since at some stage, for an upgrade or some other reason, it will be required to terminate the instance and launch a new one. All the scripts and code that are stored in “/cpmdata/scripts” reside on the CPM data volume, and will remain there even if the instance is terminated. So, for “paramiko” I copied the library locally to my scripts folder. I did the following:

  > cd /tmp > wget https://github.com/paramiko/paramiko/archive/master.zip (If the link changes go to the “paramiko” site [[link]] and find the new one)
> sudo apt-get install unzip
(This is indeed installing, but I only need this utility for now…)
> mv paramiko-master/paramiko/ /cpmdata/scripts

Now, I wanted to make sure the executable scripts have executable permissions for the “cpmuser” user:

  > chmod u+x /cpmdata/scripts/*_mongo_cluster

Then I copied over all needed SSH private keys (“.pem” files) to allow the scripts to connect to all remote machines. It is important to keep permissions minimal for those key files (read-only permission for user “cpmuser” only). The last thing needed was to edit the script configuration file (mongo_db_cluster_locker.cfg) correctly, here is an example:

[MONGOS]
address=mongos.example.com
ssh_user=ubuntu
ssh_key_file=/cpmdata/scripts/mongosshkey.pem
mongo_port = 27017
timeout_secs = 60
[CONFIGSERVER]
address=mongoconfig.example.com
ssh_user=ubuntu
ssh_key_file=/cpmdata/scripts/mongosshkey.pem
mongo_port = 27019
timeout_secs = 60
[SHARDSERVER_1]
address=mongoshard1.example.com
ssh_user=ubuntu
ssh_key_file=/cpmdata/scripts/mongosshkey.pem
mongo_port = 27018
timeout_secs = 60
[SHARDSERVER_2]
address=mongoshard1.example.com
ssh_user=ubuntu
ssh_key_file=/cpmdata/scripts/mongosshkey.pem
mongo_port = 27018
timeout_secs = 60

When editing the configuration file, please make sure all paths to key files, addresses, ports, and credentials are correct.
To test the script, just ran it from command line:

  > ./before_mongo_cluster
> ./after_mongo_cluster 1

CPM mongodb cluster backup monitor
CPM mongodb cluster backup log
CPM mongodb cluster recovery panel
CPM mongodb cluster before output

When testing, please make sure not to leave the cluster locked…
Now all is set and CPM will take it from here. It started using the scripts and performing backup as scheduled:

Clicking on “Open” in the Log column in the backup monitor screen displays the backup process step by step:

Clicking on the “Recover” button displays the recovery panel screen:

All the instances involved in the backup are shown here. It is possible to recover them, and also to view the output of the backup scripts by clicking on the “External Data” links:

All steps of the MongoDB shard cluster locking scripts are displayed here. If an error occurred, this output can give an insight as to what the issue was.

Recovery:

CPM allows recovery of complete instances, and it is possible to recover an entire cluster in a matter of minutes. If DR was defined on the backup policy, it is even possible to perform that recovery in another AWS region.
Two things need to be taken into account:

  1. EBS snapshots of the shard servers are taken while the server is locked. When recovered, MongoDB detects the old lock file and refuses to start working, assuming the server crashed before. Of course, this is not a real issue in this case. It can be seen easily in the MongoDB log:> grep “old lock file” /var/log/mongodb/mongodb.log
    > old lock file: /data/mongod.lock.  probably means unclean shutdown,
    > Sat Sep 21 20:10:52 [initandlisten] exception in initAndListen: 12596 old lock file, terminatingThis is easy to fix, but requires performing this manual operation on all shards:  > sudo rm /data/mongod.lock
      > sudo service mongodb start
  2. If you the original addresses of the shard servers are preserved, the cluster will work without any changes. That can be done by using elastic IP addresses, and assigning them to the recovered shard servers (assuming the old cluster is dead). This address association is done in the AWS Management Console; CPM will not take care of that. When using new addresses, it is required to change the configuration in the config servers for the cluster to start working again.

Conclusion:

It is a short and easy process to define consistent backup of a MongoDB shard cluster using CPM and backup scripts.
For any comments, feedback, correction or suggestions, you can contact us at info@n2ws.com
Follow N2W Software on LinkedIn

Next step

The easier way to recover cloud workloads

Allowed us to save over $1 million in the management of AWS EBS snapshots...

N2WS vs AWS Backup

Why chose N2WS over AWS Backup? Find out the critical differences here.

N2WS in comparison to AWS Backup, offers a single console to manage backups across accounts or clouds. Here is a stylized screenshot of the N2WS dashboard.

Try N2WS for Free