A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems

Chunk-level deduplication plays an important role in backup storage systems. Existing Content-Defined Chunking (CDC) algorithms, while robust in finding suitable chunk boundaries, face the key challenges of (1) low chunking throughput that renders the chunking stage a serious deduplication performance bottleneck, (2) large chunk size variance that decreases deduplication efficiency, and (3) being unable to find proper chunk boundaries in low-entropy strings and thus failing to deduplicate these strings. To address these challenges, this paper proposes a new CDC algorithm called the Asymmetric Extremum (AE) algorithm. The main idea behind AE is based on the observation that the extreme value in an asymmetric local range is not likely to be replaced by a new extreme value in dealing with the boundaries-shifting problem. As a result, AE has higher chunking throughput, smaller chunk size variance than the existing CDC algorithms, and is able to find proper chunk boundaries in low-entropy strings. The experimental results based on real-world datasets show that AE improves the throughput performance of the state-of-the-art CDC algorithms by more than $2.3\times$ , which is fast enough to remove the chunking-throughput performance bottleneck of deduplication, and accelerates the system throughput by more than 50 percent, while achieving comparable deduplication efficiency.

[1]  Nikolaj Bjørner,et al.  Content-dependent chunking for differential compression, the local maximum approach , 2010, J. Comput. Syst. Sci..

[2]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[3]  Matei Ripeanu,et al.  StoreGPU: exploiting graphics processing units to accelerate distributed storage systems , 2008, HPDC '08.

[4]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[5]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[6]  George Varghese,et al.  EndRE: An End-System Redundancy Elimination Service for Enterprises , 2010, NSDI.

[7]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[8]  Hong Jiang,et al.  P-Dedupe: Exploiting Parallelism in Data Deduplication System , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[9]  Hong Jiang,et al.  Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets , 2014, 2014 Data Compression Conference.

[10]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[11]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[12]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[13]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[14]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[15]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[16]  Jatinder Pal Singh,et al.  Asymmetric caching: improved network deduplication for mobile devices , 2012, Mobicom '12.

[17]  Bing Zhou,et al.  Hysteresis Re-chunking Based Metadata Harnessing Deduplication of Disk Images , 2013, 2013 42nd International Conference on Parallel Processing.

[18]  Dongsheng Wang,et al.  A Novel Optimization Method to Improve De-duplication Storage System Performance , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[19]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[20]  Aditya Akella,et al.  Redundancy in network traffic: findings and implications , 2009, SIGMETRICS '09.

[21]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[22]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[23]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[24]  Cezary Dubnicki,et al.  Anchor-driven subchunk deduplication , 2011, SYSTOR '11.

[25]  Hong Jiang,et al.  A Comprehensive Study of the Past, Present, and Future of Data Deduplication , 2016, Proceedings of the IEEE.

[26]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[27]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[28]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[29]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[30]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.