AWS is a scalable and high-performance computing infrastructure used by many organizations to modernize their IT. However, no system is invulnerable, and if you want to ensure business continuity, you need to have some kind of insurance in place. Disaster events do occur, whether they are a malicious hack, natural disaster, or even an outage due to a hardware failure or human error. And while AWS is designed in such a way to greatly offset some of these events, you still need to have a proper disaster recovery plan in place. Here are 10 tips you should consider when building a DR plan for your AWS environment.
1. Ship Your EBS Volumes to Another AZ/Region
By default EBS volumes are automatically replicated within an Availability Zone (AZ) where they have been created, in order to increase durability and offer high availability. And while this is a great way to protect your data from relying on a lone copy, you are still tied to a single point of failure, since your data is located only in one AZ. In order to properly secure your data, you can either replicate your EBS volumes to another AZ, or even better, to another region.
To copy EBS volumes to another AZ, you simply create a snapshot of it, and then recreate a volume in the desired AZ from that snapshot. And if you want to move a copy of your data to another region, take a snapshot of your EBS, and then utilize the “copy” option and pick a region where your data will be replicated.
2. Utilize Multi-AZ for EC2 and RDS
Just like your EBS volumes, your other AWS resources are susceptible to local failures. Making sure you are not relying on a single AZ is probably the first step you can take when setting up your infrastructure. For your database needs covered by RDS, there is a Multi-AZ option you can enable in order to create a backup RDS instance, which will be used in case the primary one fails by switching the CNAME DNS record of your primary RDS instance. Keep in mind that this will generate additional costs, as AWS charges you double if you want a multi-AZ RDS setup compared to having a single RDS instance.
Your EC2 instances should also be spread across more than one AZ, especially the ones running your production workloads, to make sure you are not seriously affected if a disaster happens. Another reason to utilize multiple AZs with your EC2 instances is the potential lack of available resources in a given AZ, which can occur sometimes. To properly spread your instances, make sure AutoScaling Groups (ASG) are used, along with an Elastic Load Balancer (ELB) in front of them. ASG will allow you to choose multiple AZs in which your instances will be deployed, and ELB will distribute the traffic between them in order to properly balance the workload. If there is a failure in one of the AZs, ELB will forward the traffic to others, therefore preventing any disruptions.
With EC2 instances, you can even go across regions, in which case you would have to utilize Route53 (a highly available and scalable cloud DNS service) to route the traffic, as well as do the load balancing between regions.
3. Sync Your S3 Data to Another Region
When we consider storing data on AWS, S3 is probably the most commonly used service. That is the reason why, by default, S3 duplicates your data behind the scene to multiple locations within a region. This creates high durability, but data is still vulnerable if your region is affected by a disaster event. For example, there was a full regional S3 outage back in 2017 (which actually hit a couple other services as well), which led to many companies being unable to access their data for almost 13 hours. This is a great (and painful) example of why you need a disaster recovery plan in place.
In order to protect your data, or just provide even higher durability and availability, you can use the cross-region replication option which allows you to have your data copied to a designated bucket in another region automatically. To get started, go to your S3 console and enable cross-region replication (versioning must be enabled for this to work). You will be able to pick the source bucket and prefix but will also have to create an IAM role so that your S3 can get objects from the source bucket and initiate transfer. You can even set up replication between different AWS accounts if necessary.
Do note though that the cross-region sync starts from the moment you enable it, so any data that already exists in the bucket prior to this will have to be synced by hand.
4. Use Cross-Region Replication for Your DynamoDB Data
Just like your data residing in S3, DynamoDB only replicates data within a region. For those who want to have a copy of their data in another region, or even support for multi-master writes, DynamoDB global tables should be used. These provide a managed solution that deploys a multi-region multi-master database and propagates changes to various tables for you. Global tables are not only great for disaster recovery scenarios but are also very useful for delivering data to your customers worldwide.
Another option would be to use scheduled (or one-time) jobs which rely on EMR to back up your DynamoDB tables to S3, which can be later used to restore them to, not only another region, but also another account if needed. You can find out more about it here.
5. Safely Store Away Your AWS Root Credentials
It is extremely important to understand the basics around security on AWS, especially if you are the owner of the account or the company. AWS root credentials should ONLY be used to create initial users with admin privileges which would take over from there. Root password should be stored away safely, and programmatic keys (Access Key ID and Secret Access Key) should be disabled if already created.
Somebody getting access to your admin keys would be very bad, especially if they have malicious intentions (disgruntled employee, rival company, etc.), but getting your root credentials would be even worse. If a hack like this happens, your root user is the one you would use to recover, whether to disable all other affected users, or contact AWS for help. So, one of the things you should definitely consider is protecting your account with multi-factor authentication (MFA), preferably a hardware version.
The advice to protect your credentials sometimes sounds like a broken record, but many don’t understand the actual severity of this, and companies have gone out of business because of this oversight.
6. Define your RTO and RPO
Recovery Time Objective (RTO) represents the allowed time it would take to restore a process back to service, after a disaster event occurs. If you guarantee an RTO of 30 minutes to your clients, it means that if your service goes down at 5 p.m., your recovery process should have everything up and running again within half an hour. RTO is important to help determine the disaster recovery strategy. If your RTO is 15 minutes or less, it means that you potentially don’t have time to reprovision your entire infrastructure from scratch. Instead, you would have some instances up and running in another region, ready to take over.
When looking at recovering data from backups, RTO defines what AWS services can be used as a part of disaster recovery. For example if your RTO is 8 hours, you will be able to utilize Glacier as a backup storage, knowing that you can retrieve the data within 3–5 hours using standard retrieval. If your RTO is 1 hour, you can still opt for Glacier, but expedited retrieval costs more, so you might chose to keep your backups in S3 standard storage instead.
Recovery Point Objective (RPO) defines the acceptable amount of data loss measured in time, prior to a disaster event happening. If your RPO is 2 hours, and your system has gone down at 3 p.m., you must be able to recover all the data up until 1 p.m. The loss of data from 1 p.m. to 3 p.m. is acceptable in this case. RPO determines how often you have to take backups, and in some cases continuous replication of data might be necessary.
7. Pick the Correct DR Scenario for Your Use Case
The AWS Disaster Recovery white paper goes to great lengths to describe various aspects of DR on AWS, and does a good job of covering four basic scenarios (Backup and Restore, Pilot Light, Warm Standby and Multi Site) in detail. When creating a plan for DR, it is important to understand your requirements, but also what each scenario can provide for you. Your needs are also closely related to your RTO and RPO, as those determine which options are viable for your use case.
These DR plans can be very cheap (if you rely on simple backups only for example), or very costly (multi-site effectively doubles your cost), so make sure you have considered everything before making the choice.
8. Identify Mission Critical Apps and Data and Design Your DR Strategy Around Them
While all your applications and data might be important to you or your company, not all of them are critical for running a business. In most cases not all apps and data are treated equally, due to the additional cost it would create. Some things have to take priority, both when making a DR plan, and when restoring your environments after a disaster event. An improper prioritization will either cost you money, or simply risk your business continuity.
9. Test your Disaster Recovery
Disaster Recovery is more than just a plan to follow in case something goes wrong. It is a solution that has to be reliable, so make sure it is up to the task. Test your entire DR process thoroughly and regularly. If there are any issues, or room for improvement, give it the highest possible priority. Also don’t forget to focus on your technical people as well as—they too need to be up to the task. Have procedures in place to familiarize them with every piece of the DR process.
10. Consider Utilizing 3rd-party DR Tools
AWS provides a lot of services, and while many companies won’t ever use the majority of them, for most use cases you are being provided with options. But having options doesn’t mean that you have to solely rely on AWS. Instead, you can consider using some 3rd-party tools available in AWS Marketplace, whether for disaster recovery or something else entirely. N2WS Backup & Recovery is the top-rated backup and DR solution for AWS that creates efficient backups and meets aggressive recovery point and recovery time objectives with lower TCO. Starting with the latest version, N2WS Backup & Recovery offers the ability to move snapshots to S3 and store them in a Veeam repository format. This new feature enables organizations to achieve significant cost-savings and a more flexible approach toward data storage and retention. Learn more about this here.
Disaster recovery planning should be taken very seriously, nonetheless many companies don’t invest enough time and effort to properly protect themselves, leaving their data vulnerable. And while people will often learn from their mistakes, it is much better to not make them in the first place. Make disaster recovery planning a priority and consider the tips we have covered here, but also do further research.