Lazy exact deduplication

During data deduplication, on-disk fingerprint lookups lead to high disk traffic, resulting in a bottleneck. In this paper, we propose a “lazy” data deduplication method which buffers incoming fingerprints and performs on-disk lookups in batches, aiming to reduce the disk bottleneck. In deduplication in general, prefetching is used to improve the cache hit rate by exploiting locality within the incoming fingerprint stream. For lazy deduplication, we design a buffering strategy that preserves locality in order to similarly facilitate prefetching. Experimental results indicate that the lazy method improves fingerprint identification performance by over 50% compared with an “eager” method with the same data layout

[1]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[2]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[3]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[4]  Kyu Ho Park,et al.  Rethinking deduplication in cloud: From data profiling to blueprint , 2011, The 7th International Conference on Networked Computing and Advanced Information Management.

[5]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[6]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[7]  Philip Shilane,et al.  Memory efficient sanitization of a deduplicated storage system , 2013, FAST.

[8]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[9]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[10]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[11]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[12]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[13]  Robert Ricci,et al.  Metadata Considered Harmful...to Deduplication , 2015, HotStorage.

[14]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[15]  Mike Hibler,et al.  Using Deduplicating Storage for Efficient Disk Image Deployment , 2015, EAI Endorsed Trans. Scalable Inf. Syst..

[16]  David Hung-Chang Du,et al.  BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data deduplication on flash , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Lei Yang,et al.  De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization , 2014, Cluster Computing.

[18]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[19]  Sean Matthew Dorward,et al.  Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[20]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[21]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[22]  Tyng-Yeu Liang,et al.  A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters , 2013, The Journal of Supercomputing.

[23]  Michiel H. M. Smid,et al.  On the false-positive rate of Bloom filters , 2008, Inf. Process. Lett..

[24]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[25]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[26]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[27]  Hong Jiang,et al.  Similarity and Locality Based Indexing for High Performance Data Deduplication , 2015, IEEE Transactions on Computers.

[28]  Jongmoo Choi,et al.  Deduplication in SSDs: Model and quantitative analysis , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[29]  M. Mitchell Waldrop,et al.  The chips are down for Moore’s law , 2016, Nature.

[30]  Gang Wang,et al.  Lazy exact deduplication , 2016, MSST.

[31]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[32]  Xin Li,et al.  A highly parallel GPU-based hash accelerator for a data deduplication system , 2009 .

[33]  Gang Wang,et al.  Towards Fast De-duplication Using Low Energy Coprocessor , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[34]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[35]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[36]  Patrick P. C. Lee,et al.  RevDedup: a reverse deduplication storage system optimized for reads to latest backups , 2013, APSys.

[37]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[38]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[39]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[40]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[41]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.