5 Guidelines for Recovery Drills within the AWS Cloud

recovery drills aws cloud cpmIn this post, we will explore the ins and outs of recovery drills: why they are necessary, how to perform them, as well as how recovery, in general, has changed over the years.

Running recovery drills is an integral part of a complete and reliable backup solution. Having the ability to perform a backup with minimal downtime and quick data recovery is crucial when the need arises. However, at times, we see customers performing backup routines for their data, without completely recovering their online services, applications, and data. There is always a chance that trying to perform a complete recovery may lead to various problems.

Recovery drills are a vital part of small companies’ and enterprises’ recovery plan in case of data loss and outages. They aid in strict compliance and regulation standards that require sending periodic reports on system performance, security and availability. By leveraging the AWS cloud’s flexibility, an efficient recovery mechanism can be built and easily tested from time to time.

Below, we will discuss 5 general guidelines that cover why and how recovery drills should be performed within the AWS cloud, especially when it comes to EC2.

5 Guidelines for Your Cloud Recovery Drill

1 – Routine Drills

Make sure you perform recovery drills on a somewhat frequent basis. Due to environmental changes (i.e. software upgrades, expansion, etc…), you can’t guarantee that a drill you perform now will work a year from now. Every company has different needs, so this frequency needs to be customized according to your environment. With that, every recovery drill needs to completely finish, meaning if you manage to fully recover a whole critical production environment, make sure the affected servers function properly and that the data is consistent and not corrupt.

2 – Document the Recovery Drill

Recording each recovery drill in detail will help the process run smoothly. For example, if you recover an Oracle EC2 instance that was locked during backup, due to switching the database to backup mode, you have to return the database to normal operations once you recover the data in order to make sure the server runs correctly. Everything that you want to recover (i.e. EC2 instances, security groups, AMIs, etc..) needs to be organized in very structured and easy to read documents so you can recover what you need when you need it.

3 – Check Data Consistency

After you have recovered data, in order to fully complete the process, you need to check that the recovered data is consistent. By applying various tools to your file system, be it Windows Chkdsk, or Linux Fsck, you can ensure proper access to data after recovery. While disk checking utilities may be able to fix the inconsistent data after recovery, we do not recommend them. A better approach would be to retrace your steps and find out why the actual backup was inconsistent in the first place. Certain utilities even check specific applications and databases for discrepancies(i.e. Eseutil for Microsoft Exchange and Mysqlcheck for MySQL).

In a typical data loss scenario, the latest backup is the most important, however, if that backup does not provide the most accurate data, you want to rely on a backup that you know is dependable. For example, if logical data loss were the result of a particular event (i.e. human error or an attack), you would want to revert back to the backup prior to the event, not necessarily the most recent backup. Recovery drills ensure whatever backup you end up using is valid or figure out what other actions need to be taken.

4 – Drill Implementation

The right kind of preparation will ease your recovery process, including precisely defining the recovery process for each individual component as well as the overall environment’s recovery features (i.e. location). The following examples can help you plan and implement your recovery in different locations.

  1. If your production instance crashes and needs to be recovered, you should first understand the state of the crashed instance (i.e. whether it was damaged/if it exists at all). If the instance was terminated, a new instance can be recovered, as opposed to the old one with the same ID and host name, and so on.
  2. Recover an EC2 instance without replacing the old one. Essentially, you have to recover it on top of the old one in order to copy data or run some data validation tests (without switching out the entire instance).
  3. Recover an instance to another subnet that is separate from your production’s main subnet. Prepare the subnet ahead of time, or at least write down exactly how you want it defined once it is needed. That way the newly recovered instance doesn’t affect your main environment.
  4. If you are hesitant to perform recovery on your production environment, you can prepare a sandbox environment. Ensure that the environment is similar to your real-life environment so that the recovery drills can be performed on the sandbox environment.
  5. When recovering EC2 instances or EBS volumes to another region, you have to make sure your backup processes copy the related AMIs and snapshots to the recovery location. At times, deploying your application in a different region requires additional work to be done on recovery. For example, with Linux servers, you need to manage processes such as obtaining a valid kernel ID. Knowing that kernel IDs are specific to their region, you cannot use the kernel ID from your original instance on a recovered instance that is located in a different region, or it will not function properly. Preparing an adequate environment in another region requires paying attention to and updating features such as VPCs, security groups, and AMIs.

5 – Manual Vs. Automation

Even if many processes are automated, performing a recovery drill will guard your environment and enhance your availability. The challenge with automating everything in your environment is that while tasks seem to be carried out effortlessly, you may find out after the fact that half of the them were not actually completed. Tasks can be automated in a script, however the real challenge is making sure the script works properly.

Then and Now – The Traditional World vs the Cloud

Recovery using traditional methods requires a certain level of hardware preparation, which is not so conducive to real world problems in need of immediate action. At times, a recovery replication process would require you to modify a standby server, so that the replicated data could be ready when recovery is needed, which is quite tricky. Conversely, in the world of the cloud and virtualization, servers are readily available, saving you precious time and money. All you need is the ability to launch a new virtual machine or EC2 instance and you can recover all of your volumes (i.e. whole environment).

Final Words

It is vital to the survival of your business to know how to properly recover data. Luckily, leveraging the cloud for recovery saves you the hassle of preparing hardware and other resources that you would otherwise need for traditional recovery processes. The 5 guidelines listed above detail the importance of recovery drills within the AWS cloud. For even though the cloud makes recovery easier, it is still necessary to make sure it’s done right. With these drills, you can better prevent data loss at nearly all costs.

Share this post →

You might also like: