The past few years have seen unprecedented growth in the big data field. Unlike traditional database systems, big data sets are distributed over a network and require parallel processing for greater efficiency. Backup of huge file systems and databases over these distributed systems has presented a significant challenge. While the backup process has become simpler and more affordable with the development of incremental transfers, cloud-based servers, and commodity hardware, DevOps teams still face many challenges
Location of Storage
There are multiple options available for backing up big data, including tapes and local or cloud storage.
- Tapes are widely used for data backup and recovery in case of a failure. However, these tapes are prone to media errors, and wear and tear due to heavy use may sometimes render data unrecoverable.
- Cloud storage is emerging as the most trusted data storage option, and is beginning to overshadow its competitors. However, bandwidth, network capability, and the amount of data to be transferred from virtual machines to the cloud infrastructure are seen as issues of concern in creating backups.
Metadata fidelity remains one of the most important concerns while performing a backup. Even a minute change in metadata can result in huge data loss or inconsistency in the data sets. Categorizing metadata is helpful in managing exponential amounts of data, and at least two copies of metadata should be stored in two different types of media, for example, a tape and a disk.
Security and Resilience
Safeguarding data is a major concern, as security may be compromised in the process of data backup. End-to-end encryption is critical, especially when it comes to cloud storage, with its reliance on third-party enterprises. Data should be protected while in transit, and should remain encrypted while in storage at its destination.
Data set size can be minimized using compression and source-side deduplication. Data deduplication, often referred to as intelligent compression, is a process which eliminates redundant data and reduces the storage overhead, bandwidth usage, and cloud backup storage costs.
Hash collisions present a potential problem in data deduplication. During the process, every small chunk of data is allocated a hash value, which is then compared with the existing set of hash numbers. In the case of hash number duplication, the associated data chunk is considered a duplicate, and is not stored. In some rare cases, the same hash number may be allocated to different chunks of data. Data loss occurs in this situation due to a false positive, the details of which are discussed here.
Over-provisioning of resources causes performance issues and results in server downtime. Hence, as the size of the data set is increasing, scalability of resources should also be considered. Therefore, while performing a backup, it is important that compute and storage capacity should be decoupled. One way to separate them is through investment in storage infrastructure that can scale storage and compute separately.
Hybrid Cloud Storage
Many enterprises running short of archive storage infrastructure or wishing to move older data to the cloud opt for hybrid cloud storage. Other users, lacking computational resources, may need to move entire data sets for computing. Data synchronization presents a major challenge here, as data, dark data, and metadata all need to be in sync.
Automated Snapshots as an Effective Backup Mechanism
A “snapshot” refers to the state of data frozen in a particular moment, and is an efficient backup technique with minimal incremental data. Typically recovery through automated snapshots raises a lot of questions regarding data consistency and requires a lot of manual testing and verification. Cloud Protection Manager is a backup and disaster recovery solution that automates the tedious and manual process of recovering snapshots so data recovery time is reduced to being virtually instantaneous. CPM allows you to backup as frequent as you need and save snapshots as long as you need. Snapshots may incur significant storage overhead, as big data systems are incremental in nature. Vital Points for Backup and Recovery
- Commodity hardware is reliable, and scales seamlessly while remaining cost efficient.
- Scripts are error-prone; the consistency of metadata, data sets, and dark data should be checked regularly when using python, perl, or shell scripts. Replications of the cluster cannot be used as previous backup points, as any changes in the file system are immediately replicated to these copies as well.
- Generating derived data from raw data might appear simple, but it must be understood that the cluster performing the computation is subject to variations in efficiency.
Enterprises must realize the importance of implementing a thorough, strategic backup plan, as mere snapshots or local disk backup may not suffice. Backing up data to a cloud infrastructure offers advantages over the use of tapes or virtual machines. Ideal backup procedures adopt a deduplication, incremental-forever approach, which reduces costs. No matter which plan you choose to execute, it is vital to prepare a full-fledged strategy in advance, covering all aspects of the entire system, before moving directly to backup.