In the age of the cloud, we can not only enjoy managing our resources in a flexible way, and without the need for capital investment. We can also benefit from the fact that prominent cloud providers (e.g. AWS) have excellent infrastructures, that gives us excellent durability of data and compute resources.
1. Do you really need to back up your EC2?
Not surprising, the answer to this rhetorical question is yes, assuming they contain real data, for the following reasons:
- Cloud Outages: AWS keeps your virtual server disks (EBS volumes) on some sort of storage array. You can be pretty sure that the AWS team uses enough redundancy to minimize the chance of failure resulting in data loss. However, wouldn’t it be safer to know you also have a recent and consistent copy of your data elsewhere? No system is 100% bulletproof.
- Continuity: Another important point is that even if outages end with all the data intact, there is still downtime to consider. EC2 availability zones are on different infrastructures: different buildings, power supplies, communication infrastructure, and so on. This means that when an outage occurs in one availability zone, there’s a good chance that other zones will work. If you have means to quickly recover your instances in another availability zone, you can minimize downtime. In case of a larger scale outage, where a whole region may be down, having the ability to recover your instances in another region will minimize risk.
- Human Errors: Regardless of the durability of your volumes, you may experience data loss because of a human error. Someone in your organization (e.g., an IT person during an update installation) can lose data by mistake. It can be data in one of the volumes, or even a complete system crash. For those cases you will want to have a recent and consistent backup of your data, in order to recover your instance quickly.
- Software Malfunctions: Every software product has bugs. This is because humans (if you view software developers as humans) write the code. Although most bugs will not result in data loss, especially in well-known and mature products, some do. Again, if something of this sort happens, you will want a recent and consistent backup of your data.
- Malicious Attacks: We all know about malicious attacks. A lot of EC2 users use the cloud as a platform for web and game hosting. Web facing applications are more vulnerable to attacks than internal applications. In case of an attack that causes data loss, it is very important to have a recent and consistent backup in order to recover from the attack as quickly as possible.
- Regulation Compliance and Long term Archiving: In traditional IT structures, the operational backup is usually separate from data archiving. Today, operational backup is typically done with a solution based on snapshots and is disk to disk, which ensures fast recovery. Archiving that doesn’t require quick recovery but does need cost-effective long term storage is usually saved to tapes. Another solution is a hierarchical storage device that automatically moves data to tape over time. In the AWS environment, the most logical solution will archive files into S3 and Glacier. As a new feature, AWS allows you to define objects in S3; the objects are automatically migrated to Glacier over time, giving you the capabilities of a hierarchical storage device.
Keep in mind >>Backup is a type of insurance policy. You are fine without one until you need it, and then it can be too late.
2. What are your EC2 backup policies and objectives?
The answer is actually quite simple. Our traditional DR experience should serve here. Your policy and objectives should be the same objectives you had for the same data when it was managed in your traditional data center.
RTO & RPO
Above all, your business needs dictate your backup objectives. Usually, we aim to minimize the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). Minimal RPO means that if we need to recover our data, we want the data to be as recent as possible, which by extension means that we need the backups to be frequent enough. Minimal RTO means that when we need to actually recover the data, our downtime will be minimal, i.e., the recovery process needs to be fast enough.
For example, some EC2 users back up their instances by shutting them down and creating a whole image (AMI) of them. They shut them down to make sure the data on the volumes is consistent. Because this operation requires downtime, they only perform this backup once a week. Because it has high costs, and because every image is a full image of the data, they only keep the most recent one (or maybe two). But is this good enough for their business needs? If the data kept on these instances is production business data, then losing a week can be devastating. Sometimes even losing one day or one hour of data can be unacceptable. If that is the case, you probably aren’t using the right tool for the job.