Business continuity is the process of minimizing threats to a company through the prevention of incidents and the development of recovery scenarios. Natural disasters, disgruntled employees, malicious hacks, and innocent human errors are just a few of the elements that can affect critical business processes and put your company in danger. While we all want to avoid threatening situations, in reality, they are not always preventable. This makes recovery critical.
Disaster recovery plan is often discussed, but it is not always implemented properly. There are articles about recovery best practices all over the internet—and there are just as many articles describing how badly companies were hit when something actually went wrong. Weakly implemented backup plans and a lack of disaster recovery drills continue to be commonplace. Lack of time and money are often provided as excuses for this, since it can be hard to justify big investments into business continuity. Because not investing enough can sometimes be as bad as not investing at all, many companies gamble on the bare minimum and hope for the best.
This and our second part of the series will provide an overview of four AWS disaster recovery scenarios. The first two, the simple backup and restore scenario and the use of a pilot light, will be covered in Part 1. Part 2 of this article will examine the warm standby solution and the multisite solution. Although we’ll be looking at these examples in the context of AWS, the principles behind them can be applied elsewhere. In our next article, we’ll also examine how our tool, N2WS Backup and Recovery, can help you manage these scenarios.
Scenario 1: Cloud Backup and Restore
Backup and restore is the simplest of the four AWS disaster recovery scenarios. Assuming you back up your important data regularly, you can enable your business environment to continue functioning by restoring it after a disaster occurs.
While this scenario is the one most frequently used (its requirements are minimal), it involves the most recovery time. There is always a trade-off between the front end investment needed to implement a scenario and the time it takes to perform the recovery after an incident has occurred. The less time you spend preparing for a disaster, the more time it takes to bounce back from it. For most companies, the appropriate balance point between these two factors will be determined by Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). RTOs and RPOs are defined in organizations’ contractual obligations to their clients. One of the key aspects of DR is identifying your RTO & RPO to help design the right DR solution. Make sure you understand what your company is required to provide before choosing a disaster recovery scenario and implementing disaster recovery best practices.
Your backup and restore setup will vary depending on the AWS services you use. There are many use cases for copying snapshots to Amazon S3 as well as using Amazon S3 storage tools. You can make regular snapshots of your EBS volumes (which are stored in S3), or you can manually upload the data to this object storage solution. (More efficiently, you can move EBS snapshots and copy to an Amazon S3 bucket using N2WS). Proper lifecycle policies should be implemented to control the cost of storing these backups. Using AWS Glacier may be a cost-saving possibility (there are also many use cases for AWS Glacier) if your RTO and RPO allow for it, although it will increase your recovery time.
When the inevitable disaster does occur, the recovery process depends on many factors, including the severity of the event. In the simplest case, you can restore your data from the Amazon S3 bucket in the same region and be done with the process. It might entail provisioning a new instance from the snapshot or copying some files over, but no major work should be necessary.
On the other hand, a full regional outage might leave you with fewer options. You might have to copy the snapshot to another AWS region and recreate your workloads there. More work might be needed since running instances in a non-primary region requires networking, permissions, and everything else to be in place. This is why you should consider which regions you want to use in case of an emergency. Doing this additional preparation work can reduce your downtime.
While the backup and restore scenario is the slowest of the AWS Disaster Recovery Scenarios, not everyone needs an “always on” setup. As a result, backup and restore is a completely viable disaster recovery option for many companies.
Scenario 2: Pilot Light
The idea behind the pilot light scenario is to maintain a small, cheap, passive environment (the metaphor of a small flame is used here), often in another AWS region, that can be rapidly expanded to a fully functional active environment (igniting the flame) if the primary region fails. Usually, only the core services are actually running in this scenario, so your pilot light will frequently include a database (which usually takes a long time to provision) and perhaps a few instances. Sometimes, only the Auto Scaling group is in place, with the desired instance count set to 0.
This scenario facilitates a much faster recovery time than the backup and restore scenario, but it obviously requires much more front end work as well as a greater financial investment. Preparing the pilot light environment is no small task. Basically, you need to replicate your existing infrastructure. Although the networking and user access only need to be set up once, the databases might need to be replicated often. Also, your EC2 instances must be provisioned from the latest AMI you use, along with all of the patches and updates. There should be no difference whatsoever between the instances in your active and passive environments. Automation can speed up the necessary processes here.
In a pilot light scenario, the recovery phase is simpler than it is in a backup and restore scenario, since you’re already running most of what you need. The first step is to start scaling by growing the desired capacity for your EC2 instances using Auto Scaling Groups and everything else that was running at minimal capacity. And, while horizontal scaling is almost always preferred, you can scale vertically by upgrading the instance sizes. The next step is to bring up all of the other supporting systems and services that weren’t already set up. These are usually non-core system services that would incur unnecessary costs if run in the pilot light environment. Ideally, you’ll be bringing these supporting systems and services into the environment in a pre-planned, automated fashion. Any data that needs to be brought from your primary region into the pilot light environment must be copied in case some up-to-date files and resources are required. Finally, you’ll most likely need to switch your DNS settings so that your records point to your new primary region—the pilot light environment.
After the disaster event has passed and all subsequent issues have been resolved, you can return to your primary region and scale your pilot light back down to a small flame that is ready to be used again when needed.
Last words on Backup & Restore & Pilot Light
In this first part of this two-part article, we’ve examined the first two AWS disaster recovery scenarios: backup and restore and pilot light. These are the two least expensive options available, and they are also the two options with the slowest recovery times. Still, for the majority of users, they are more than sufficient, since they can get an environment up and running again after a disaster occurs in a primary region and can be quite cost effective. If these two options do not meet your needs, our Part 2 which cover warm standby and multisite scenarios should certainly be taken into consideration.
Having laid the groundwork, our next blog post will look at warm standby and multsite scenarios, as more robust, however much pricier options. We’ll also look a bit deeper at how N2WS Backup and Recovery fits into these scenarios, and why your company is likely to benefit from using it.