The goal of disaster recovery (DR) is to bring a system back to life after a data loss event (“the disaster”). Organizations deploy disaster recovery facilities in order to ensure business continuity for their on-premise and cloud-based IT data.
DR is considered an essential survivability function, not a nice-to-have luxury. Disaster recovery can be done in two ways: hot standby failover or backup and recovery. While the former method is an excellent solution (although very costly), it is insufficient to address many business needs. In this article, we will explain that regardless of whether your datacenter or public cloud IT is protected by DR of either failover, you also need to maintain quality backup. In particular, we’ll provide examples relating to Amazon Web Services.
When it comes to using the public cloud in particular, the concept and the tools that enables operations to deal with the infrastructure as code facilitate the DR logic implementation. For example, with AWS tools such as CloudFormation, it is easy to implement the replication and recovery of complete application stacks. Regardless of the advanced logic implemented and activated at the time of recovery, the result can only be as good as the backup. You can never “fix” bad backup data, but you can always quickly fix an insufficient logic within your backup and recovery processes.
Furthermore, when the need arises to restore a system state (or parts of it) to a selected past point-in-time, cloud disaster recovery is not the solution. Only periodical backup that saves a consistent system timeline can help. Disastrous data loss is a rare (and unfortunate) occurrence. System rollbacks to a particular point-in-time due to software defects, human error, or other local conditions are much more frequent. Disaster recovery typically isn’t designed to handle such situations – it is simply not enough.
We will give meaning to the concept of ‘quality data backup’, which is a fundamental condition to disaster recoverability as well as selective point-in-time restoration. We will do this by explaining and examining the three factors that determine the quality of backup data: application awareness, granularity, and currency (up-to-datedness).
Application Aware Backup
The most important property of a quality backup is its application awareness. An application-aware backup is sensitive to the momentary consistency and integrity of the data in regards to the applications that produce or consume it. Simply taking a crash consistent backup copy, like in the case of AWS taking an AWS EBS volume snapshot, is not enough. For a failover DR, the argument holds true as well. When the outage occurs, the system may be in an inconsistent state. Failing over to a mirror site and bringing back the processes (VM recovery, for example) may end up in a broken, corrupt state. In order to correct the situation, an application-aware backup ensures that a recent copy of the system state can be recovered in regards to application semantics, underlying storage, and VM state.
An application aware backup ensures that the applications accessing the database are brought to a state of quiescence prior to the taking of the snapshot, ensuring application recoverability. Furthermore, a certain data store may be accessed by multiple users and multiple applications. The backup must be approached in such a way that the whole system is in quiescence, which is something that a simple snapshot can never achieve due to the narrow view of the facility. Snapshots are not sensitive to application states, and much less so to system states.
1. Granular Data Recovery
A disaster recovery facility is typically based on a stand-by mirror and failover; when the primary site experiences outage, clients’ connections are transferred to a secondary site where stand-by mirrors are ready to continue business operations. However, not every data loss entails a full system failure. Restoring a complete system stack is an important and key advantage of cloud. However, a granular recovery system allows operators to selectively restore their lost and necessary data without compromising application integrity.
2. Point-in-time Rollback
A cloud disaster recovery is important, but cloud provider originated disasters are rare. More commonly, a need arises to roll back a production system (or a portion thereof) to a well-known consistent point in time. Disaster recovery by itself cannot address this need.
Point-in-time recovery is necessary when certain conditions arise in result of a myriad of causes, such as software defects that cause data corruption, malicious action that destroys or corrupts the running system, human errors in data entry, and many others. All production IT environments are vulnerable to such occurrences, and only a quality backup that maintains a history of consistent point-in-time backup copies can mitigate these vulnerabilities.
3. RPO In-line with The Business
A quality backup should guarantee that the time interval between the backup data age and the time of data loss (or desired point-in-time recovery) is at or below the threshold that represents the data loss tolerance of the IT organization. The relevant parameter is named Recovery Point Objective (RPO) and its value should be as low as possible, given the existing SLA. Clearly, the RPO value of a mirrored failover solution is low, but maintaining geographically dispersed mirror sites is expensive. Providers often factor the mirroring cost into the total solution, which increase the solution TCO.
Customers need to carefully assess the tradeoff between their business needs for minimal RPO and the cost of achieving that goal. RPO should be predictable — the knowledge of the time interval between last the consistent backup and the moment of data loss is an important survivability parameter.
As mentioned previously, one can never compensate for high RPO values by deploying a superb recovery system. The time gap is already “burned into” your backup data. It is simply too late.
We have described the essential characteristics of a quality backup and argued that a viable disaster recovery (which is, after all, the whole point of backup!) directly and closely depends on it. We also demonstrated that a disaster recovery system is insufficient by itself since it does not provide selective point-in-time system restoration.
As long as your backed up data is application-consistent, sufficiently granular, and closely up-to-date, you have options. If your data does not meet the requirements for quality backup, your disaster recovery system cannot fix it. It is forever broken.