Planning for A Rainy Day: Key Considerations for EC2 Instance Recovery

rainy-day2

Your AWS applications run on EC2 instances in the form of Linux or Windows VMs. These instances are associated with resources such as storage, CPU, and networking, similar to physical computers. And similarly to physical machines, EC2 instances can fail. This can occur in result of soft or hard crashes, network connectivity problems, and more. When a server fails, it needs to be brought back to life and recovered in order to resume operation. Ideally, the server can be recovered with its data intact so that its state and connectivity can be fully restored.

This is the case in a variety of situations, whether a physical server failed, a virtual machine in a customer’s data center crashed, or a server running on an EC2 instance in the AWS cloud stopped operating.

An IT organization operating in the AWS cloud that runs its workloads on EC2 servers needs to establish cloud disaster recovery procedures for its compute environment. Clearly, considerations for planning instance recovery in the AWS cloud are different from recovery plans for on-premises servers. When planning for EC2 recovery, you must take into account the unique AWS environment and the tools and techniques that are available on the platform. From the point of view of a client application, when an EC2 instance is a Linux or Windows server, there are different ways to configure EC2 instances that will impact the planning for recovery. These elements need to be taken into considerations as well. In this article, we discuss some of the most important considerations that should go into the planning of EC2 recovery.

Windows vs. Linux

There are differentiating factors within EC2 instance recovery depending on the OS of the guest VMs. We discuss varying AMI virtualization types in a subsequent section. It’s important to note that AWS has some restrictions and limitations on Windows guest VMs. For example, the way that you launch an AMI based on an existing snapshot in Windows is different from how you would do it in Linux. In other cases, you may need to periodically create an AMI from your running instance and use it to re-launch your failed EC2 Windows instance. There are additional Windows-specific concerns as well. For detailed insights on the AWS Windows AMI launching peculiarities, read this article by Uri Wolloch, CEO of N2WS.

High Availability Architecture

The simplest recovery for EC2 instances can be achieved with high availability architecture. EC2 instances can be deployed as failover pairs which run in different availability zones (AZ) with a single Elastic IP (EIP). Upon failure of the primary instance, clients are transferred to the failover instance and processing can continue uninterrupted. High availability comes at a cost. Firstly, the multiple AZ deployment is more expensive than a single AZ deployment. There also may be performance issues as well as rare inconsistencies due to brief delays in synchronization between the members of the failover pairs. You may want to carefully select a part of your workload that’s more sensitive to downtime and deploy it in a high availability architecture in order to optimize your overall cost. The same considerations also apply when handling cross-AWS regions or even cross-AWS accounts (to build an isolated and secure second site).

Protecting the system’s “hard drive”

An EC2 server needs a hard drive just like a physical server. AWS provides two types of storage to serve as instance storage (the root device). One option, known as ‘instance store’, is based on block storage allocated in physical disks attached to the physical servers in AWS data centers. It’s also the less expensive of the two. However, instance store has a limited life cycle. When a failure occurs, all changes that were made to the data in the instance stored disappear and the instance store is reinitialized. If you want your system to be preserved across reboots, you need to use the second system storage option: EBS volumes used as raw block device. This method offers several ways for protection and failure survivability.

The EBS volume attached to an EC2 instance, serving as the root device, is treated identically to any EBS volume. You may use EBS snapshots in order to protect the storage. Recovering a failed EC2 instance can be done by using the EBS volume as the root device, exactly as rebooting physical server is done from its own attached hard drive. Another benefit is that Amazon now supports AMI creation from encrypted EBS volumes, enhancing the security of the customer’s data. It is possible to use EBS for the system’s persistent data while temporary storage utilizes the less expensive instance store, which is unrecoverable.

Instance Virtualization (PV vs. HVM)

For server virtualization, AWS uses the Xen hypervisor, which features two virtualization types: paravirtual (PV) and Hardware Virtual Machine (HVM). In general, Amazon prefers that you use HVM, which guarantees complete transparent OS virtualization. Historically, AWS provided a partial virtualization option for Linux named ‘paravirtualization’ that allowed guest servers to directly access resources efficiently, bypassing the hypervisor. However, modern HVM virtualization addresses all inefficiency issues and is the ideal option for AMI configuration. For your Linux PV based instances, you need a region-specific kernel object for each Linux instance. Consider a scenario where you want to recover or build an instance in some other AWS region. In that scenario, you need to find a matching kernel, which can be tedious and complex. Note that for Windows AMIs, this issue is irrelevant since the PV virtualization type is not available. Learn more about EC2 Virtualization Options And AWS Move From PV To HVM

Additional points to keep in mind

As with any plan designed for recoverability of running systems, the business tolerance to loss of data and downtime are major factors. For example, a large retail website may experience intolerable revenue loss if it experiences downtime of just a few seconds, while a research institution may tolerate longer downtime. Quantifying these tolerances is an important step in the planning of EC2 recovery methods.

How monitoring and troubleshooting are done is also important. AWS provides tools, such as CloudWatch Alarms or Instance Status Checks, that will notify you when problems occur so that you (or your automated tools) can take remedial action. Fixing a problem can be done by an operator, automated scripts launched when certain events occur, or you may consider third-party tools for this task. In addition, AWS ‘Best Practices’ highly recommends that failure and recovery scenarios should be practiced and recovery drills should be held regularly. Conducting these drills is something that you should seriously consider in your planning.

Final Note

As we have discussed, planning your EC2 deployment is the key. We recommend the practices listed above so that when a failure occurs, your services will be up and running as soon as possible, subject to your specific demands and within your survivability tolerance. Launching an application stack after a failure can be done by cloud users, either manually or automatically. However, for many of us, the tasks of planning, deploying and monitoring the appropriate solution is daunting. This is the reason many users consider utilizing third-party solutions to manage these complexities.

Share this post →

You might also like: