Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata

Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a deduplication system. Rewriting algorithms are used to alleviate the fragmentation problem by improving the restoring speed of a deduplication system. However, rewriting methods give birth to a big sacrifice in terms of deduplication ratio, leading to a huge storage space waste. Furthermore, traditional backup approaches treat file metadata and chunk metadata as the same, which causes frequent on-disk metadata accesses. In this article, we start by analyzing storage characteristics of backup metadata. An intriguing finding shows that with 10 million files, the file metadata merely takes up approximately 340 MB. Motivated by this finding, we propose a Classified-Metadata based Restoring method (CMR) that classifies backup metadata into file metadata and chunk metadata. Because the file metadata merely takes up a meager amount of space, CMR maintains all file metadata in memory, whereas chunk metadata are aggressively prefetched to memory in a greedy manner. A deduplication system with CMR in place exhibits three salient features: (i) It avoids rewriting algorithms’ additional overhead by reducing the number of disk reads in a restoring process, (ii) it increases the restoring throughput without sacrificing the deduplication ratio, and (iii) it thoroughly leverages the hardware resources to boost the restoring performance. To quantitatively evaluate the performance of CMR, we compare our CMR against two state-of-the-art approaches, namely, a history-aware rewriting method (HAR) and a context-based rewriting scheme (CAP). The experimental results show that compared to HAR and CAP, CMR reduces the restoring time by 27.2% and 29.3%, respectively. Moreover, the deduplication ratio is improved by 1.91% and 4.36%, respectively.

[1]  Yuhui Deng,et al.  EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud , 2018, IEEE Transactions on Cloud Computing.

[2]  Hao Wang,et al.  Energy Optimization by Software Prefetching for Task Granularity in GPU-Based Embedded Systems , 2020, IEEE Transactions on Industrial Electronics.

[3]  David Hung-Chang Du,et al.  Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[4]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[5]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[6]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[7]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[8]  Mark Lillibridge,et al.  Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service , 2007, FAST.

[9]  Hao Wang,et al.  Multi-Parameter Performance Modeling Based on Machine Learning with Basic Block Features , 2019, 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).

[10]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[11]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[12]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[13]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[14]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[15]  Yuhui Deng,et al.  Exploring the performance impact of stripe size on network attached storage systems , 2008, J. Syst. Archit..

[16]  Dongsheng Wang,et al.  A Novel Optimization Method to Improve De-duplication Storage System Performance , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[17]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[18]  Yuhui Deng,et al.  What is the future of disk drives, death or rebirth? , 2011, ACM Comput. Surv..

[19]  André Brinkmann,et al.  File recipe compression in data deduplication systems , 2013, FAST.

[20]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[21]  C. Moallemi,et al.  The Cost of Latency ∗ , 2009 .

[22]  Hong Jiang,et al.  P-Dedupe: Exploiting Parallelism in Data Deduplication System , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[23]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[24]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[25]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[26]  Chase Qishi Wu,et al.  End-to-End Delay Minimization for Scientific Workflows in Clouds under Budget Constraint , 2015, IEEE Transactions on Cloud Computing.

[27]  Yuhui Deng,et al.  Memory Deduplication: An Effective Approach to Improve the Memory System , 2017, J. Inf. Sci. Eng..