EBS snapshots would be the best and most efficient choice for operational backup. EBS snapshots are the equivalent of hardware snapshots in traditional data centers. In this post, we will discuss the challenges you encounter when you use EBS snapshots. EBS snapshots are provided by AWS as infrastructure ability. You can run a snapshot on a volume by using the AWS Management Console or by using the API. There are a few scripts circulating on the web that can perform snapshots even while freezing the XFS file system. Some users use them with “cron” jobs in Linux or Windows task scheduler to automate the backup process. There are not many other solutions available.
Although you can use any software backup solution, snapshots are faster and they don’t affect your instance’s performance. EBS snapshots are incremental, which positively affects not only their performance but also the cost of storing them. Since they are light, you can take snapshots much more frequently than you would with a file-based backup solution. The snapshots are stored in S3, which gives them excellent durability. Furthermore, AWS recently released a new feature, “EBS Snapshot Copy,” which allows the copying of EBS snapshots between regions, giving them extra value as a disaster recovery solution as well as a solution for migrating volumes between regions. Let’s look at the challenges of EBS snapshots backup:
1 – Management
Management of snapshots becomes a challenge in larger cloud environments. When the number of snapshots you manage reaches the hundreds (or even less), it is important to not lose your way. If you have many snapshots and it’s difficult to distinguish which snapshots need to be deleted and which snapshots to use for a critical recovery operation, you have a problem. If you are leaving many snapshots because you’re not sure they should be deleted, it’s not only confusing; it needlessly increases your AWS bill. What if you have different instances that require different backup policies? What if some instances need to be backed up daily and keep data available for 30 days while other instances need to be backed up every 2 hours and keep data available for 2 weeks? When something changes in the environment (e.g. A new instance, a new volume on an instance etc…), it is important to be able to easily adjust.
2 – Frequent Backup
If you are able to manage and control an environment with many snapshots, you could easily take snapshots much more frequently without the risk that you will get lost under an unmanageable mountain of snapshots. When you take frequent snapshots, you reduce risk by ensuring that when in need of recovery, the data will be recent, and thus you minimize your RPO (Recovery Point Objective).
3 – Monitoring (or Visibility)
Another extension of the ability to manage a large and complex environment is to be able to monitor your backup and make sure everything is going as planned. You need to be sure you know what’s really going on. If a backup on some instance stops running for some reason, when will you find about it (hopefully before a recovery is necessary…)? You need to be able to quickly and easily determine that all your snapshots are running correctly and according to plan. If your database doesn’t freeze and snapshots are not consistent, you want to know about it as well as have an easy way to understand what went wrong. If something has changed in your environment you need to find out and correct backup configuration as soon as possible with minimal difficulty.
4 – Application Support
EBS snapshots are by nature “crash-consistent.” They present the image of the volume exactly as it was at the point-in-time of the snapshot, but there are no guarantees on what was the state of the system and applications at that time. It is exactly like the image of a physical disk at the moment that someone pulled out the power cord. When later booting the system, you can’t guarantee that everything will work correctly. In many cases, you will want to move your application into “backup mode”, or what is also called “application quiescence.” This means you want to “tell” your application that it’s about to be backed up just before you take a snapshot. Your application will then momentarily “freeze” its activity (Make sure data on the volume is consistent by flushing caches, closing files etc). At this point, the snapshot/s will start. Immediately after they start, you need to “thaw” or “unfreeze” your application and allow them to continue working. This freezing time is usually not more than 1-2 seconds and allows applications to continue working and serving requests without the need to fail or close any sessions or connections. Sometimes you will want to perform additional operations on your applications after the backup has finished consistently and successfully. This is a good time to perform transaction logs truncation on a database and prevent the logs from growing unnecessarily and consuming excessive storage space. You want your backup solution to be flexible enough to support such operations easily and to support different kinds of environments and applications. A small list of applications that are supported includes MySQL, PostGreSQL, MongoDB, XFS (a file system can be viewed as an application) and probably VSS (Volume Shadow Service) for Windows servers, to support SQLServer, SharePoint etc.
5 – Rapid and Easy Recovery
Backup solutions are usually good enough when you don’t need to perform recovery. They’re like insurance policies: they’re only put to the test when something bad happens. When you need to recover data, you will not want to start looking through dozens or hundreds of snapshots in your AWS console to figure out which you need to use. You don’t want to start thinking which snapshots belong to which volumes and instances and what would be the best ones to choose. You also don’t want to start digging to find what configuration (e.g. instance type, security groups, key pairs, tags) you need to launch your instances, that may have even crashed and no longer exists. Maybe you need to get the most recent consistent backup because the freeze failed? Maybe you need the most recent consistent backup from last Wednesday? There are many possible scenarios. When the situation is stressful, you don’t want to make mistakes which are highly probable with manual processes. In the “moment of truth,” when you need to recover critical business data, you want to perform recovery as quickly as possible and without making mistakes.
To answer all the challenges in managing a backup in an EC2 environment, you need the right solution. Such a solution should allow you to easily manage and monitor the backup of all your instances and volumes, as well as support consistent backups, like file system freezing and consistent application backup. It should be easily managed and deployed and provide automatic retention management (deletion of old snapshots) to help you easily and quickly recover complete instances and separate volumes.