As you may know, EBS snapshots are crash-consistent by nature. In practice, this means that the recovery process of a server (or instance) from EBS snapshots (unless application quiescence was used during backup), will be comparable to the process of booting a machine after a system crash or power outage. In most cases the machine will boot with no serious problems. You can read more on crash consistency in an earlier post on this subject.
Assuming this server (or instance) was actively processing data during the time the snapshots were taken, several potential problems may occur: Vital data may be cached in memory and not yet written to disk, files may be opened and in the middle of a write operation, transactions can be opened and not complete. Even in these scenarios, in most cases the recovery process will result in a functional application and not a corrupted one. This is because modern file systems and application have built-in mechanisms to deal with these kinds of scenarios.
Databases implement transaction logs, also called binary logs. These are files that actually log all the updates made in the database. In case the system is recovering from a crash, the application or database will inspect the log and either complete or rollback open transactions to achieve the most updated consistent state. Virtually all modern databases implement transaction logs: Oracle, SQL Server, MySQL, MongoDB and others.
Modern file systems have similar mechanisms to ensure that the files and structure of the file system are kept intact after a crash or outage. In file systems this type of tracking mechanism is called journaling but the mechanism is similar and the purpose remains the same: to make sure the file system remains consistent. Modern file systems like NTFS, Ext3 & Ext4 all implement journaling.
Are built-in logging & journaling mechanisms bullet proof?
Let’s assume that a crash or a point-in-time of a snapshot occurs during a write operation to such a log or journal? Could they become corrupted as well? The answer to this question is not absolute: Write operations to these logs are typically small, resulting in a relatively low chance for the logging or journaling operation to be half-done. But consistency is not 100% guaranteed. To minimize this risk, various kinds of mechanisms and algorithms were developed. One of the common types is to compute checksums on every entry in the log/journal. After a crash, the recovery process verifies that the checksums are correct and discard any entry with a bad checksum.
The difference between a crash-consistent snapshot and an actual crash.
During a real system crash, bad and unexpected things may happen and the more complex the environment is, the more likely that such negative events will occur. I’ve seen files get corrupted by garbage written to them. An operating system in an erroneous state can behave unpredictably. So, there is a chance of any file or application to become corrupted. A crash-consistent snapshot, like an EBS snapshot, happens typically during normal operations. So, the chance of a corruption that cannot be overcome by journaling or transaction logs is relatively small.
With transaction logs and journaling mechanisms, crash-consistent snapshots will work most of the time. The probability of a corrupted backup is low. That said, for critical servers and applications used for production, there is a need to make sure that the backup is consistent and recoverable, since a corruption can cause serious damage to businesses. In that case, the recommended best practice is to implement methods to ensure backup is application-consistent.
Cloud Protection Manager (CPM) is an enterprise-class backup solution for EC2 based on EBS & RDS snapshots. It supports consistent backup of applications on Linux servers as well as Windows servers. CPM is sold on AWS Marketplace with prices ranging from $62.5/month to $500/month. See pricing or try it for free.