10. Disaster Recovery (DR)

Performing cross-region disaster recovery. Configuring CPM disaster recovery, how disaster recovery works, and DR planning. Disaster recovery monitoring and troubleshooting.

Contents

 

 

10 – Disaster Recovery (DR)

CPM’s DR (Disaster Recovery) solution allows you to recover your data and servers in case of a disaster. DR will help you recover your data for whatever reason your system was taken out of service.

 

What does that mean in a cloud environment like EC2? Every EC2 region is divided into AZs which use separate infrastructure (power, networking, etc.). Because CPM uses EBS snapshots you will be able to recover your EC2 servers to other AZs. CPM’s DR is based on AWS’s ability to copy EBS snapshots between regions and allows you the extended ability to recover instances and EBS volumes in other regions. You may need this ability if there is a full-scale outage in a whole region. But it can also be used to migrate instances and data between regions and is not limited to DR. If you use CPM to take RDS snapshots, those snapshots will also be copied and will be available in other regions.

  • DynamoDB Tables: DR for DynamoDB tables is currently not supported by AWS.
  • Redshift Clusters: Currently CPM does not support DR of Redshift clusters. If you enable DR on a policy containing Redshift clusters, they will be ignored at the DR stage. You can enable copying Redshift snapshots between regions automatically by enabling cross-region snapshots using the EC2 console.

 

10.1 – Configuring DR

After defining a policy, click the DR button under the Configure column in the Policies tab of the main screen.

red_dr_form

Figure 10‑1

 

In the DR Options screen, configure the following:

  • Enable DR – Whether DR is enabled for this policy. By default, DR is disabled.
  • Perform DR Every – Frequency of performing DR in terms of backups. The default is to copy snapshots of all backups to other regions. To reduce costs, you may want to reduce the frequency. See section 10.4 below for considerations in planning DR.
  • Target Regions – Which region or regions you want to copy the snapshots of the policy to.
  • DR Timeout (hours) – How long CPM waits for the DR process on the policy to complete. DR copies data between regions over a WAN (Wide Area Network) which can take a long time. CPM will wait on the copy processes to make sure they are completed successfully. If the entire DR process is not completed in a certain timeframe, CPM assumes the process is hanging, and will declare it as failed. Twenty-four hours is the default and should be enough time for a few 1 TiB EBS volumes to copy. Depending on the volume, however, you may want to increase or decrease the time.

 

 

10.2 – About the DR Process

Thing to know about the DR process:

  • CPM’s DR process runs in the background.
  • It starts when the backup process is finished. CPM determines then if DR should run and kicks off the process.
  • CPM will wait until all copy operations are completed successfully before declaring the DR status as Completed as the actual copying of snapshots can take time.
  • As opposed to the backup process that allows only one backup of a policy to run at one time, DR processes are completely independent. This means that if you have an hourly backup and it runs DR each time, if DR takes more than an hour to complete, the DR of the next backup will begin before the first one has completed.
  • Although CPM can handle many DR processes in parallel, AWS limits the number of copy operations that can run in parallel in any given region to avoid congestion. See section 10.4.2.
  • CPM will keep all information of the original snapshots and the copied snapshots and will know how to recover instances and volumes in all relevant regions.
  • The automatic retention process that deletes old snapshots will also clean up the old snapshots in other regions. When a regular backup is outside the retention window and its snapshots are deleted, so are the DR snapshots that were copied to other regions.

 

 

10.3 – DR and mixed-region policies

CPM supports backup objects from multiple regions in one policy. In most cases, it would probably not be the best practice, but sometimes it is useful. When you choose a target region for DR, DR will copy all the backup objects from the policy to that region, which are not already in this region. For example, if you back up an instance in Virginia and an instance in North California, and you choose N. California as a target region, only the snapshots of the Virginia regions will be copied to California. So, you can potentially implement a mutual DR policy: choose Virginia and N. California as target regions and the Virginia instance will be copied to N. California and vice versa. This can come in handy if there is a problem or an outage in one of these regions. You can always recover the instance in the other region.

 

10.4 – Planning your DR Solution

10.4.1 – Time and Financial Considerations

There are some fundamental differences between local backup and DR to other regions. It is important to understand the differences and their implications when planning your DR solution. The differences between storing EBS snapshots locally and copying them to other regions are:

  • Copying between regions is transferring data over a WAN. It means that it will be much slower than moving data locally. A data transfer from the U.S to Australia or Japan will take considerably more time than a local copy.
  • AWS will charge you for the data transfer between regions. This can affect your AWS costs, and the prices are different depending on the source region of the transfer. For example, in March 2013, transferring data out of U.S regions will cost 0.02 USD/GiB and can climb up to 0.16 USD/GiB out of the South America region.

 

As an extreme example: You have an instance with 4 1 TiB EBS volumes attached to it. The volumes are 75% full. There is an average of 3% daily change in data for all the volumes. This brings the total size of the daily snapshots to around 100 GiB. Locally you take 4 backups a day. In terms of cost and time, it will not make much of a difference if you take one backup a day or four, which is true also for copying snapshots, since that operation is incremental as well. Now you want a DR solution for this instance. Copying it every time will copy around 100 GiB a day. You need to calculate the price of transferring 100 GiB a day and storing them at the remote region on top of the local region.

 

10.4.2 – Timing your DR processes

You want to define your recovery objectives both in local backup and DR according to your business needs. However, you do have to take costs and feasibility into consideration. In many cases it is ok to say: For local recovery I want frequent backup, four times a day, but for DR recovery it is enough for me to have a daily copy of my data. Or, maybe it is enough to have DR every two days. There are two ways to define such a policy using CPM:

  • In the definition of your policy, select the frequency in Perform DR every…. If the policy runs four times a day, configure DR to run once every four backups. The DR status of all the rest will be Skipped.
  • Or, define a special policy for the DR process. If you have a sqlserver1 policy, define another one and name it something like sqlserver1_dr. Define all targets and options the same as the first policy, but choose a schedule relevant for DR. Then define DR for the second policy. Locally it will not add any significant cost since it is all incremental, but you will get DR only once a day.

 

10.4.3 – Performing DR on the CPM Server (The cpmdata Policy)

To perform DR recovery, you will need your CPM server up and running. If the original server is alive, then you can perform recovery on it across regions. You want to prepare for the case where the CPM server itself is down. You may want to copy your CPM database across regions as well. Generally, it is not a bad idea to place your CPM server in a different region than your other production data. CPM has no problem working across regions and even if you want to perform recovery because of a malfunction in only one of the AZs in your region, if the CPM server happens to be in that zone, it will not be available.

To make it easy and safe to back up the CPM server database, there is a special policy named cpmdata. Although CPM supports managing multiple AWS accounts, the only account that can back up the CPM server is the one that owns it, i.e. the account used to create it. Define a new policy and name it cpmdata (case insensitive), and it will automatically create a policy that backs up the CPM data volume in a consistent manner.

 

Not all options are available with the cpmdata policy, but you can control:

  • Scheduling
  • Number of generations, and
  • DR settings

When setting these options, remember that at the time of recovery you will need the most recent copy of this database, since older ones may point to snapshots that no longer exist and not have newer ones yet. Even if you want to recover an instance from a week ago, you should always use the latest backup of the cpmdata policy.

 

 

10.5 – DR Recovery

DR recovery is similar to regular recovery with a few differences, as shown in Figure 10‑2:

  • When you click the Recover button for a backup that includes DR (DR is in Completed state), you get the same Recovery Panel screen with the addition of a drop-down list.

10. Disaster Recovery (DR)

Figure 10‑2

  • The DR Region default is Origin, which will recover all the objects from the original backup. It will perform the same recovery as a policy with no DR.
  • When choosing one of the target regions, it will display the objects and will recover them at the selected region.

Recommendation: N2WS strongly recommends that you perform recovery drills occasionally to be sure your recovery scenario works. It is not recommended to try it for the first time when your servers are down. Each policy on the policy screen shows the last time recovery was performed on it. Use the last recovery time data to track recovery drills.

 

10.5.1 – DR Instance Recovery

Volume recovery is the same in any region. With instance recovery there are a few things that need considering. An EC2 instance is typically related to other EC2 objects:

  • Image ID (AMI)
  • Key Pair
  • Security Groups
  • Kernel ID
  • Ram disk ID

 

These objects exist in the region of the original instance, but they do not mean anything in the target region. In order to launch the instance successfully, you will need to replace these original objects with ones from the target region:

  • Image ID (AMI) – If you intend to recover the instance from a root device snapshot, you will not need a new image ID. If not (as in all cases with Windows and instance store-based instances), you will need to type a new image ID. If you use AMIs you prepared, you should also prepare them at your target regions and make their IDs handy when you need to recover. If needed, AMI Assistant can help you find a matching image (see section 9.2.3).
  • Key Pair – You should have a key pair created with AWS Management Console ready so you will not need to create it when you perform a recovery.
  • Security Groups – In a regular recovery, CPM will remember the security groups of the original instance and use them as default. In DR recovery, CPM cannot choose for you. You need to choose at least one, or the instance recovery screen will display an error. Security groups are objects you own, and you can easily create them in AWS Management Console. You should have them ready so you will not need to create them when you perform recovery.
  • Kernel ID – Linux instances need a kernel ID. If you are launching the instance from an image, you can leave this field empty, CPM will use the kernel ID specified in the AMI. If you are recovering the instance from a root device snapshot, you need to find a matching kernel ID in the target region. If you do not do so, a default kernel will be used, and although the recovery operation will succeed and the instance will show as running in AWS Management Console, it will most likely not work. AMI Assistant can help you find a matching image in the target region (see section 9.2.3). When you find such an AMI, copy and paste its kernel ID from the AMI Assistant window.
  • RAMDisk ID – Many instances do not need a RAM disk at all and this field can be left empty. If you need it, you can use AMI Assistant the same way you do for Kernel ID. If you’re not sure, use the AMI Assistant or start a local recovery and see if there is a value in the RAMDisk ID field.

 

10.5.2 – DR of Encrypted Volumes, AMIs and RDS Instances

CPM supports DR of encrypted EBS volumes. If you are using KMS keys for encryption, CPM will seek a KMS key in the target region, which has the same alias.

 

To configure your cross-region DR:

Create a matching-alias key in the source and in the remote region for CPM to use automatically in the DR copy process.

  • If a matching key is not found in the target region, the DR process will fail.
  • If the key uses the default encryption, then it will be copied to the other region with the default encryption key as well.
  • CPM supports copy of AMIs with encrypted volumes with the same logic it uses for volumes.
  • CPM supports cross-region DR of encrypted RDS databases.

Note: To let CPM see keys and aliases, add these two permissions to the IAM policy that CPM is using: kms:ListKeys, kms:ListAliases.

 

10.5.3 – A Complete Disaster Recovery Scenario

Let’s assume a real disaster recovery scenario: The region of your operation is completely down. It means that you do not have your instances or EBS volumes, and you do not have your CPM Server, as it is down with all the rest of your instances. Here is Disaster Recovery step by step:

  1. With AWS Management Console:
    • Find the latest snapshot of your cpmdata policy by filtering snapshots with the string cpmdata. CPM always adds the policy name to any snapshot’s description.
    • Sort by Started in descending order and it will be the first one on the list.
    • Create a volume from this snapshot by right-clicking it and choosing Create Volume from Snapshot. You can give the new volume a name so it will be easy to find later.
  2. Launch a new CPM Server at the target region. You can use the Your Software page to launch the AWS Marketplace AMI. Wait until you see the instance in running state.
  3. As with regular configuration of a CPM server:
    • Connect to the newly created instance using HTTPS.
    • Approve the SSL certificate exception. Assuming the original instance still exists, CPM will come up in recovery mode, which means that the new server will perform recovery and not backup.
    • If you are running the BYOL edition and need an activation key, most likely you do not have a valid key at the time, and you do not want to wait until you can acquire one from N2W Software.
      • You can quickly register at CPM Basic Edition. In step 2 of the registration, use your own username and type any password. In step 3, choose the volume you just created for the CPM data volume. Afterwards, complete the configuration.
  4. With a working CPM server, you can perform any recovery you need at the target (current) region:
    • Select the backup you want to recover.
    • Click Recover.
    • Choose the target region from the drop-down list.
  5. You can recover all the backed-up objects that are available in the region.

Note: If your new server allows backup (it can happen if you registered to a different edition or if the original one is not accessible), it can start to perform backups. If that is not what you want, it is best to disable all policies before you start the recovery process.

 

 

10.6 – DR Monitoring and Troubleshooting

DR is a straightforward process. If DR fails, it probably means that either a copy operation failed, which is not common, or that the process timed-out. You can track DR’s progress in the backup log where every stage and operation during DR is recorded:

10. Disaster Recovery (DR)

Figure 10‑3

 

You can also view DR snapshot IDs and statuses in the snapshots screen of the backup:

10. Disaster Recovery (DR)

Figure 10‑4

 

Every DR snapshot is displayed with region information and the IDs of both the original and the copied snapshots.

 

If DR fails, you will not be able to use DR recovery. However, some of the snapshots may exist and be recoverable. You can see them in the snapshots screen and, if needed, you can recover from them manually.

 

If DR keeps failing because of timeouts, you may need to increase the timeout value for the relevant policy. The default of 24 hours should be enough, but there may be a case with a very large amount of data, that may take longer.

Reminder: You can only copy a limited number of snapshots to a given region at one time. Currently the number is 5. If the limit is reached, CPM will wait for copy operations to finish before it continues with more of them which can affect the time it takes to complete the DR process.

Share this post →