1. EC2 Disaster Recovery – Introduction
Many AWS users are not fully aware of the need to implement an EC2 disaster recovery solution. Isn’t AWS immune to malfunctions and disasters? The answer to that is no, and we can’t complain about it, since no infrastructure is.
EC2 is built for high availability of applications. The infrastructure is very robust, and a lot of measures were taken to ensure this robustness. There are several EC2 regions, or data centers. They are geographically disperse around the globe and are completely independent of each other. Each region is divided into availability zones. Availability zones are separate in their networking, power supply and other infrastructure. It is possible for an outage to occur. However, it is unlikely for an outage to span across all availability zones of a region, because of this separation. There, however, is a small chance of an outage across an entire region. This is an extreme situation, which can happen if there is a region wide malfunction, extreme weather conditions, or, god forbid, a natural or manmade disaster, that is why you need to implement the EC2 disaster recovery.
2. EC2 Disaster Recovery – High Availability solutions
One of the ways to face the possibility of any type of malfunction is using Ec2 disaster recovery or instead by using high availability solutions. There are many types of solutions, but the basic idea is replicating the data to another location, and keeping hot standby systems ready to take the role of the original systems when an outage or other failure occurs. Most databases support all kinds of “master-slave” configurations and replication schemes that achieve this goal. AWS has services that implement such configurations out of the box, like RDS, that has a multi-AZ functionality that replicates the data between availability zones of the same region. If you need a high availability solution that supports replication across regions, you can implement it on your own. High availability solutions are expensive by nature, and cannot replace a disaster recovery solution, since sometimes there is logical data loss that replication can’t help with.
3. EC2 Disaster Recovery Based on EBS Snapshots
Cloud Protection Manager (CPM) offers an EC2 backup solution and EC2 disaster recovery solution based on EBS snapshots. Snapshot technology allows you to take snapshots in a way that hardly affects the host, while being very efficient in storage space. Furthermore, snapshots can be used for rapid recovery, since you can create volumes from snapshots almost instantly.
Snapshots are an ideal solution for backup in case of an outage since they are saved in S3, a separate infrastructure than EC2, and are restorable from any availability zone at the region the volumes are in. With CPM you can recover your instances and data in another availability zone within minutes.
A few months ago, AWS added the ability to copy snapshots between regions. This opened the possibility to use EBS snapshots for an EC2 disaster recovery solution that spans across EC2 regions. There are a few caveats with copying EBS snapshots: although EBS snapshots are incremental, which means that only the disk blocks that were changed since the last snapshot of a volumes are stored, snapshots are copied as a whole. This means that in terms of data transfers as well as storage space to keep the snapshots at the remote region, you have to copy and store all the volume’s data for every snapshot.
So, let’s assume a 1TB EBS volume which is 80% full (800GiB of data), and has a daily change of 25Gib. For local snapshots, you store 800Gib for the first full snapshot, but then it costs you an extra 25Gib of storage daily to keep a current copy of the volume. But if you want to copy a daily snapshot to another region, it will cost you the price for copying 800Gib(+) a day, plus storing 800GiB for every snapshot you keep at the remote region. Compression reduces this amount, but not necessarily by much. This can become very expensive. Another thing to consider is the fact that copying a large amount of data over the WAN can take many hours. So if you implement a backup solution that takes a snapshot every couple of hours, it will not be feasible to copy each one to another region, even if you are willing to pay the costs.
4. EC2 Disaster Recovery – CPM’s Approach to Disaster Recovery
With Cloud Protection manager (CPM), you have complete control over all your EC2 backup aspects. You configure backup policies for your local backup, including retention windows, detailed scheduling and more. This, plus the powerful recovery options, which allows you to recover complete instances with a mouse click, adds up to a powerful solution for backup and recovery across availability zones.
Now, when you need an EC2 disaster recovery solution that works across regions (and can withstand a region wide outage), you need to weigh costs and technical limitations against your business needs and answer these two questions:
- a.EC2 disaster recovery – What is the recovery point objective I need for my local backup?
- b.EC2 disaster recovery -What is the recovery point objective I need for my remote (copied) backups?
Recovery point objective (RPO) means, how much data you will lose when recovering because the last backup is not the up-to-date copy of the data. For a daily backup the RPO can be up to a day. This means that when you recover, your data will be a day old (worst case scenario). Is that good enough for your application? CPM allows you to perform backup in a much more frequent rate than a daily backup, but you are the one weighing needs vs. cost and you decide. So why should there be a different RPO for remote regions? There are two reasons, the limitations and costs I discussed above, and the fact that the probability for a region-wide outage is much lower than that of a single availability zone. You want to be prepared for such an outage, but do you need to be as prepared as for a much more probable mishap?
CPM will allow you to set, for example, bi-hourly local backups, and copying to another region will happen on a daily basis. By allowing you to configure different frequencies you can balance your needs against costs and against the probability of an outage.
5. EC2 Disaster Recovery – Recovery in a Remote Region
On top of the actual backup and copying of snapshots you will need to be able to perform recovery using CPM Server in case of an outage. To achieve this goal, you need to ensure that the CPM Server (which is an EC2 instance) will be up and running during such an outage. You can achieve this by placing the CPM Server away from your other instances and volumes. You will surely not want the CPM server to be in the same availability zone, but since CPM works seamless across regions, perhaps the best place to put it would be at the remote region. Another approach that CPM supports is to back up the CPM Server data volume itself and copy it also to the remote region. In that case there is a procedure to quickly launch a new CPM server at the remote region and use it to recover your data.
There are challenges when recovering instances at a remote region. An instance’s configuration includes dependencies to other objects like Key Pairs, Security groups and more importantly, Kernels and Ram Disks. These objects exist at the context of the local region, and not at the remote region where you want to recover the instance. Key Pairs and Security Groups are objects you define, so it’s easy enough to define them at other regions as well. As for Kernels and Ram Disks, you need to find compatible ones at the remote region. CPM helps you with that by trying to find AMIs at the remote region that are similar to the original AMI that was used to launch the instance at the original region. When finding such an AMI, you can use it’s configuration to launch the recovered instance.
6. EC2 Disaster Recovery – Conclusion
Cloud Protection Manager (CPM) offers a comprehensive solution for both EC2 Backup and EC2 Disaster Recovery. When weighing the costs and business needs of your disaster recovery solution in case of an outage, you can take into account the probability of an region-wide outage and assume you can live with a higher recovery point objective than in case of an outage within a region. If your business doesn’t allow you for such an RPO, you will probably need to implement a high availability solution as well as the EC2 disaster recovery.
UPDATE (June 13, 2013): Yesterday AWS announces that from now on, EBS snapshots copy will be incremental. This means that every copy operation will only copy blocks that were changed since the last snapshot (of the same volume) that was copied. With this change, it will be a lot easier to make copies much more frequently, maybe even in the same frequency as the snapshots taken in the original region. Still, as copying is a more expensive operation that taking it locally, you need to plan and weigh cost against business needs.