Leverage similarity and locality to enhance fingerprint prefetching of data deduplication

Data deduplication has been widely used at data backup system due to the significantly reduced requirements of storage capacity and network bandwidth. However, the performance of data deduplication gradually decreases with the growth of deduplicated data. This is because the volume of fingerprints grows significantly with the increase of backup data, and a large portion of fingerprints have to be stored on disk drives. This incurs frequent disk accesses to locate fingerprints and blocks the process of data deduplication. Furthermore, the fingerprints belonging to the same file may be discretely stored on disk drives. This generates random and small disk accesses, and results in significant performance degradation when the fingerprints are referred. Additionally, a single fingerprint may appear only once during a backup process. This results in very low cache hit ratio due to lacking temporal locality. This paper proposes to employ file similarity to enhance the fingerprint prefetching, thus improving the cache hit ratio and the performance of data deduplication. Furthermore, the fingerprints are arranged sequently in terms of the backup data stream to maintain the locality and promote the performance. Experimental results demonstrate that the proposed idea can effectively reduce the number of fingerprint accesses going to disk drives, decrease the query overhead of fingerprints, thus significantly alleviating the disk bottleneck of data deduplication.

[1]  Jacob R. Lorch,et al.  A five-year study of file-system metadata , 2007, TOS.

[2]  Ilya,et al.  Tokyo Cabinet:超越键/值存储 , 2009 .

[3]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[4]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[5]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[6]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[8]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[9]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[10]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[11]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[12]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[13]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[14]  Yuhui Deng,et al.  What is the future of disk drives, death or rebirth? , 2011, ACM Comput. Surv..

[15]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[16]  Yuhui Deng,et al.  Identifying File Similarity in Large Data Sets by Modulo File Length , 2014, ICA3PP.

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Yuhui Deng,et al.  Exploring the performance impact of stripe size on network attached storage systems , 2008, J. Syst. Archit..

[19]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[20]  Yuhui Deng,et al.  Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[21]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[22]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[23]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[24]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[25]  Jeffrey Scott Vitter,et al.  Proceedings of the thirtieth annual ACM symposium on Theory of computing , 1998, STOC 1998.