On the morning of August 8, Delta, the world’s 2nd biggest carrier (by revenue and fleet size) lost power at its operations center, which meant that computers for booking passengers and flying jets were down for nearly five hours.
Those five hours started a domino effect that forced the company to cancel 1000 flights that day, and a similar number over the following two days. The outage caused a loss of $150 million for the company, as well as angry calls and emails from many customers.
It’s an important equation: five hours of downtime = $150 million. The solution is pretty simple, though: every business needs to carefully consider and choose a reliable backup.
(Almost) Worst-Case Scenarios
There are plenty of DR (disaster recovery) scenarios. There is no telling where (or more precisely on what level) the problem will be: There can be corrupted or missing files, a block in the volume, or even the whole volume can become compromised. While such examples are smaller-scale, a business must also prepare for more significant disasters, such as when its entire data center is malfunctioning, or for the collapse of the entire region in which the data is stored.
In AWS, each region is divided into Availability Zones (AZs) – but those can fail, too. The causes for a breakdown may be physical, such as hardware failure or a power outage. Buildings can be damaged by fire or flood, and so on. Non-physical failures may also occur; for instance, when an update is being rolled out and has serious effects on a database, or when a specific block or storage volume becomes corrupted.
The human factor is also important to take into account, as the danger from hackers is ever growing, (which we’ll explore about later with Code Space’s experience). And of course, as the saying goes, “to err is human.” People accidentally delete critical files or directories. Maintaining a backup means that organizations won’t suffer such severe consequences when such instances occur.
Backup Considerations in the Cloud
The first thing to do when developing a cloud disaster recovery DR strategy is to examine the business and the resources that are available (financial, personnel, and otherwise). Next, there is a need to examine the SLA, so that the customers’ expectations are met, which may involve a need for certain uptimes or update frequency.
There are two key metrics to consider: the recovery time objective (RTO), which is the maximum acceptable time to resume operations; and the recovery point objective (RPO), which refers to the amount of data loss that is considered acceptable.
Separation = Protection
Code Spaces’ story is a cautionary tale. In June 2014, after seven years of business, an attacker gained control of the company’s AWS control panel and demanded money in exchange. When Code Spaces refused, the attacker began deleting the company’s files. By the time the IT team managed to regain control, it was too late: the attacker deleted such a vast amount of the company’s data, they were forced to shut down for good.
Code Spaces had backup and a recovery plan. Their only mistake, albeit a massive one, was that they kept control over both the working data and the backup data together in the same account. Meaning, once the intruder had access to the account, he had access to both sets of data. In effect, the backup was useless.
The lesson to learn from this story is that data separation is a basic component of protection. Employing different accounts means improved security because it renders ransom attacks, such as this one, useless. Managing a single AWS account using AWS IAM is pretty simple. However, for this purpose you will need to implement cross-account access.
Cloudy With a Chance of Backup
A backup plan must take into consideration both cost and reliability. This often includes a need for separate locations, as the California DMV learned the hard way when its primary and secondary backups systems went offline at the same time.
There are three different backup approaches to choose from. First, there is Pilot Light, where a minimum viable stack of the environment is always running in the cloud, maintaining the core necessities, such as the databases, around which the full scale environment can be quickly developed using tools such as AWS CloudFormation. Warm Backup, the second approach, takes it up a notch; instead of a minimum viable stack, a fully operational, scaled-down one is running in the cloud at all times. The third option, and the most comprehensive, is the active/active strategy. This is a multi-site approach, where both versions (original and backup) are complete and up and running, one on site and one in the cloud, and decisions regarding the replication frequency and methods are determined by the RPO and RTO.
Location, Location, Location
Another important fact to remember is that cloud storage does not equal total and safe backup. Last June, the Australian-based AWS cloud went down and stayed down for 10 hours. This should remind everyone that any supplier of cloud services, no matter how big, is still prone to failures.
Using AWS offers another layer of protection – the multiple region or multi-AZ solution. Because applications deployed on AWS have a multi-site capability by means of multiple availability zones, they provide inexpensive, low-latency network connectivity within the same region. The additional advantage to this setup is that both primary and secondary systems share the same provider, interface, technology and tools, so that implementation is hassle free both contractually and for the IT staff. A multiple region solution is basically the same as multi-AZ, but across different regions.
Automation for the People
The constant concerns about application and data migration or upgrades cannot be alleviated (at least in most cases) by traditional local backup. As mentioned, human errors coupled with a host of imaginable and unimaginable disasters means organizations large and small are at risk.
Automation is a much safer solution. With AWS cloud infrastructure, one can use AWS automation to streamline backup operations, and even incorporate them within the DevOps environment. For example, having the backup processes and recovery tests as part of your release cycles allows for seamless updating of your secondary sites stacks.
This article started with numbers, so it makes sense to end with them, as well. The case of Amazon.com comes to mind: on January 31, 2013, just a little less than four years ago, the retail giant suffered a 49 minute outage which translated into a loss of over $4 million. No one is immune and therefore backup is vital to the fabric of the business.