EBS Snapshots: crash-consistent vs. application-consistent

When creating EBS snapshots, the snapshot mechanism makes sure snapshots are consistent. What does it mean? Every snapshot starts at a certain point-in-time, and the disk or volume image that the snapshot will reflect, will be the exact image of the volume at that point-in-time.

Crash-Consistent Snapshots

But is this image consistent in terms of the data and applications you are using the EBS volume for? The answer is that it depends: While applications are working there may be open transactions, buffers in memory not yet flushed to disk, open files and other unfinished business. When an EBS snapshot is taken while an instance is up and running, whatever’s in memory is discarded. The snapshot is “crash-consistent,” which means that it’s the same as if someone pulled out the power cord of a computer, and then turned it back on.

Most modern applications (e.g. databases) know how to recover from such a state. Modern file-systems like Ext3, Ext4 and NTFS have a journaling mechanism which will allow them, in most cases, to recover open files without leaving them corrupted. Databases have transaction logs, which will enable them to go back to the last consistent state even if they were in a middle of a transaction.

In most cases everything will work, but if you are dealing with important business data, sometimes “most” will not cut it.

Application-consistent Snapshots

When we want to make sure a snapshot is consistent, which means that when recovering it, applications will start from a consistent state and experience no issues, we want to “tell” the application it is about to be backed up, so it can get prepared. Most applications have APIs that allows you to notify them when they are about to be backed up. They can then make sure that transactions are complete, buffers flushed, files closed etc… The application enters what is sometimes called “backup mode” or simply a “freeze.” While the application is frozen, it is locked from answering requests, or at least write requests (depends on the case).

It is important to keep the application frozen for a time as short as possible, not to hinder the application’s operations (requests can begin to fail); the assumption is that the application needs to remain live during backup, and a short freeze will not fail anything. Since EBS snapshots are consistent to their point-in-time, once the snapshot starts, we can release the “freeze” operation, regardless of how long it will take for the snapshot to complete. The snapshot mechanism will ensure the snapshot’s consistency.

EBS Snapshot script consistent

What should you choose?

If you are relying on EBS snapshots as your EC2 backup solution, you need to decide on your approach based on your business needs. It is possible to trade application consistency for more frequent snapshots. If you wanted two snapshots a day on a specific volume, you take four instead. Then if you need to recover and the latest snapshot will not be consistent, meaning the application will not be able to recover from it, you can always fall back to the previous snapshot, which will most likely work, and you will not lose more data than you would with your original snapshot frequency.

That said, if your data is critical for your business and if losing it will cause your business severe damage and loss, it is better to ensure application consistency. And that includes also close monitoring on the success of the process and recovery drills to ensure the recovery process does work.

Freeze different applications

We will try and create new posts on this blog with solutions for specific applications.  You can consult the documentation of any application to find the right commands to perform this freeze and unfreeze (sometimes called and used a bit differently).

In the end it comes down to one or two commands you can run from a script. If you are using scripts to perform your EBS snapshots, you can add this to your existing scripts. Cloud Protection Manager (CPM) supports application-consistent backup by running scripts.

In Windows Servers, there is an infrastructure for application-consistent backup called VSS (Volume Shadow Copy Service). Applications that support VSS like all Microsoft applications: SQLServer, SharePoint, Exchange, Active Directory etc.., know how to “freeze” whenever a VSS backup starts. CPM supports VSS as well.

Share this post →

You might also like: