Moving away from traditional data centers and into the cloud has brought many changes to how we handle our infrastructure, as well as the processes around it. It’s had a huge effect on disaster recovery too, which has become faster, cheaper, and in general, easier to handle due to the nature of the cloud. In the past, businesses usually had to keep a second data center ready to take over in case a disastrous event occurred—which was very time consuming, and of course, expensive. Now you can provision an entire new infrastructure in a secondary location (whether another region or even account) on-demand within minutes, and you can also tear it down just as easily when it’s no longer needed.
AWS Shared Responsibility Model
When talking about disaster recovery on AWS specifically, it’s important to first understand the Shared Responsibility Model, which defines the roles and responsibilities of all parties (both Amazon and its customers) in regards to protecting resources within their public cloud. According to the model, Amazon takes care of securing the entire underlying infrastructure on a physical level (state-of-the-art data centers have multi-layered protection systems that include security guards, cameras, etc), but it is the customers’ responsibility to protect their applications and the data located in the cloud itself. This means that clients need to ensure they have comprehensive data protection, both by utilizing Amazon’s capabilities and services (security groups, access control lists, Amazon GuardDuty, AWS WAF, etc.), and by ensuring business continuity through proper planning for the actual disastrous events.
AWS Disaster Recovery Scenarios
In our previous article, we offered ten tips for a solid AWS disaster recovery plan. Still, despite the precautions we take, things can go wrong occasionally. So, let’s take a look at five disaster recovery scenarios that can happen on AWS, and a few tips for handling them.
1. Human Mistakes Leading to Accidental Resource Deletion
Termination protection is a great option for making sure your resources are safe from accidental deletion. Let’s say you have an Amazon EC2 instance running a production workload that outputs some data to an attached secondary Amazon EBS volume. Well, an admin working with some of the EC2 instances could terminate a wrong one by mistake or accidentally delete the attached EBS volume. Or even worse, the admin could terminate an entire AWS CloudFormation stack that contains not only your instance, but also a multitude of other resources. While instances can be reprovisioned quickly, terminating an instance will, by default, only preserve the attached EBS volumes. The root volume, the one containing the image used to boot the instance, will be destroyed unless the option for preserving it is specifically checked during provisioning. Termination protection can help greatly in preventing these kinds of issues, and it can be applied to both EC2 instances and CloudFormation stacks. But regular backup of data is a must for everything that is mission-critical, so make sure you have a reliable process in place.
2. Malicious Attacks
Network security is often overlooked on AWS. For some reason, customers think that their environment is secured out of the box, but that is not the case. For example, access control lists (ACLs) are set to allow all traffic by default, and it is up to you to create the rules to act as a firewall in front of your subnets. Even worse, whoever is working with your cloud environment will likely create security groups that are open to the public Internet, which could be a serious vulnerability. In a world where the competition is constantly getting stronger, a malicious attack is a realistic possibility which could have devastating consequences for your business. In order to prevent an attack of this sort, some strict policies must be followed. For example, there should be no open ports (other than those needed by your applications); networks should be segmented and protected by services like AWS WAF; and additional application-level control should be maintained on the instances themselves. Having a secondary account ready to take over is an option here as well. This would allow you to quickly re-deploy your infrastructure using CloudFormation templates and to recover your data from backups, putting you back in business with minimum downtime.
3. Compromised AWS Accounts
One of the worst things that can happen to a business running on AWS is a compromised account. Of course, the degree of the damage can vary greatly, depending on what privileges were available on that particular account. For example, if someone gets their hands on the password of a minor admin with somewhat limited access, the resulting damage will be less than if full admin, or even worse, root credentials, are stolen. This has happened in the past, and companies have gone out of business as a consequence. Preparing for such a disastrous scenario is much harder than preparing for an outside attack, as everything you have built on the account is susceptible to a complete loss. In the compromised account scenario, a hacker could literally delete all of your infrastructure, data, and backups; while in a malicious attack, only a part of your environment would likely be affected. In this instance, while the issue of a compromised account is being resolved (Amazon support must be contacted in most cases), you could rely on a secondary account to take over. Not only that, but to prepare for such an event properly, all your backups should already be safely stored away from your primary AWS account. Alternatively, you could architect your cloud infrastructure to encompass multiple AWS accounts (each account dedicated to a specific environment or team), which would greatly mitigate the area of attack. A compromised account is something you obviously want to avoid, so make sure you follow the AWS IAM best practices, and pay special attention to privileges for your users.
4. NoSQL Recovery
While all of the previous scenarios can be prevented by making sure your cloud environment is properly protected, there are some cases which are out of your hands completely. For example, in September 2015, there was a service event (initially caused by a network outage) which ultimately led to Amazon DynamoDB becoming overloaded. As a consequence, customers in the US-East-1 region experienced an increased number of error rates, making the service unusable for many. Even though the issue was resolved in a couple of hours (it did persist a bit longer for some users), many customers depended heavily on DynamoDB, making even a small outage very harmful. DynamoDB replicates data across three Availability Zones by default, but this situation showed that there is an obvious need to have your DynamoDB data backed up and ready to be served from another region. Also, you can use Global Tables, introduced in late 2017, which allow you to have automated replication of your tables across multiple regions, along with full support for multi-master writes.
5. AWS Regional Outages
Regional outages on AWS are very rare, but they have happened in the past. When they do occur, they usually affect more than one service, causing huge distress to AWS customers. Here are a few examples: In April 2011, there was an outage involving EC2 instances and EBS volumes, which affected Amazon RDS as well. During the outage, which lasted more than 48 hours, many well-known websites (like Quora, Reddit, and Foursquare) experienced issues. In August later that year, another outage occurred. Even though it was a short one, it caused Netflix to experience downtime, bringing a lot of negative attention to AWS. Netflix customers were unlucky again in 2012, when on Christmas Eve, their service was unavailable due to yet another regional AWS outage. But there’s more. In June 2016, severe weather caused a power outage in multiple facilities within the Sydney region, causing a 10-hour downtime. Even though Amazon uses redundancy for everything including power supply, it still wasn’t enough to prevent the problem, and the system was upgraded as a consequence later on. In February 2017, AWS experienced its biggest Amazon S3 outage to date. After a human error took some servers offline, Amazon S3 went dark for some five hours, and there were residual issues even after that. The outage was so bad that the Amazon dashboard that shows service health couldn’t display warnings. The outage also affected other services in some capacity, so users experienced issues with Elastic Load Balancing, Amazon RDS, AWS Lambda, Amazon EMR, Amazon EFS, Amazon SES, and others. These examples prove why a region is still a single point of failure, even if it does have multiple availability zones and networks, power, and other redundancies in place. The best (and probably only) proper way to be safe from disastrous events is to avoid relying on a single region. This is why backing up your data in another region, or in some cases, another AWS account, should always be a part of your disaster recovery plan.
Introducing N2WS Backup & Recovery
As we mentioned, completely avoiding disasters in today’s IT world is almost impossible, no matter how much you prepare. Eventually, something will go wrong, and when it does, the effectiveness of your disaster recovery plan will be put to the test. AWS offers a multitude of resources and solutions to help you with disaster recovery, but you should also keep an eye on third-party solutions, as there are many tools out there worth looking into. N2WS Backup & Recovery allows you to automate the backup and recovery process for many AWS services (EC2 instances, EBS volumes, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and others). It does this by extending and enhancing native snapshot capabilities. CPM also offers application-consistent backups, and allows you to quickly recover your data to another region, or even another account, when necessary. The tool also provides options for automated backup scheduling, custom retention periods, failure handling, and more.