A Survey and Classification of Storage Deduplication Systems

The automatic elimination of duplicate data in a storage system, commonly known as deduplication, is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid-state drives, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this article is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.

[1]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[2]  Hong Jiang,et al.  DEBAR: A scalable high-performance de-duplication storage system for backup and archiving , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Christopher Chute,et al.  The Diverse and Exploding Digital Universe , 2011 .

[4]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[5]  Peter Desnoyers,et al.  Memory buddies: exploiting page sharing for smart colocation in virtualized data centers , 2009, VEE '09.

[6]  João Paulo,et al.  DEDISbench: A Benchmark for Deduplicated Storage Systems , 2012, OTM Conferences.

[7]  Scott Devine,et al.  Disco: running commodity operating systems on scalable multiprocessors , 1997, TOCS.

[8]  David D. Chambliss,et al.  Mixing Deduplication and Compression on Active Data Sets , 2011, 2011 Data Compression Conference.

[9]  André Brinkmann,et al.  Multi-level comparison of data deduplication in a backup scenario , 2009, SYSTOR '09.

[10]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[11]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[12]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.

[13]  Isaac Woungang,et al.  A secure data deduplication framework for cloud environments , 2012, 2012 Tenth Annual International Conference on Privacy, Security and Trust.

[14]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15]  Anand Sivasubramaniam,et al.  Leveraging Value Locality in Optimizing NAND Flash-based SSDs , 2011, FAST.

[16]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[17]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[18]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .

[19]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[20]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[21]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[22]  Darrell D. E. Long,et al.  Secure data deduplication , 2008, StorageSS '08.

[23]  Suman Nath,et al.  Cheap and Large CAMs for High Performance Data-Intensive Networked Systems , 2010, NSDI.

[24]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[25]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[26]  David R. Cheriton,et al.  HICAMP: architectural support for efficient concurrency-safe shared structured data access , 2012, ASPLOS XVII.

[27]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[28]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[29]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[30]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  John C. S. Lui,et al.  Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud , 2011, Middleware.

[32]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[33]  Darrell D. E. Long,et al.  Providing High Reliability in a Minimum Redundancy Archival Storage System , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[34]  Anne-Marie Kermarrec,et al.  Probabilistic deduplication for cluster-based storage systems , 2012, SoCC '12.

[35]  Jongmoo Choi,et al.  Deduplication in SSDs: Model and quantitative analysis , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[36]  Cezary Dubnicki,et al.  HydraFS: A High-Throughput File System for the HYDRAstor Content-Addressable Storage System , 2010, FAST.

[37]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[38]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[39]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[40]  Cyrille Artho,et al.  Moving from Logical Sharing of Guest OS to Physical Sharing of Deduplication on Virtual Machine , 2010, HotSec.

[41]  Dutch T. Meyer,et al.  Parallax: virtual disks for virtual machines , 2008, Eurosys '08.

[42]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[43]  Christian S. Collberg,et al.  SLINKY: Static Linking Reloaded , 2005, USENIX Annual Technical Conference, General Track.

[44]  Aleksey Pesterev,et al.  Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[45]  Mark Lillibridge,et al.  Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service , 2007, FAST.

[46]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[47]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[48]  Tzi-cker Chiueh,et al.  Hypervisor Support for Efficient Memory De-duplication , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[49]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[50]  André Brinkmann,et al.  Design of an exact data deduplication cluster , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[51]  Anthony Liguori,et al.  Experiences with Content Addressable Storage and Virtual Disks , 2008, Workshop on I/O Virtualization.

[52]  Takashi Watanabe,et al.  DBLK: Deduplication for primary block storage , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[53]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[54]  Brian Berliner,et al.  CVS II: Parallelizing Software Dev elopment , 1998 .

[55]  Mahadev Satyanarayanan,et al.  Design Tradeoffs in Applying Content Addressable Storage to Enterprise-scale Systems Based on Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[56]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[57]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[58]  Cyrille Artho,et al.  Memory deduplication as a threat to the guest OS , 2011, EUROSEC '11.

[59]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[60]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[61]  Darrell D. E. Long,et al.  Duplicate Data Elimination in a SAN File System , 2004, MSST.

[62]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[63]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[64]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[65]  George Varghese,et al.  Difference engine , 2010, OSDI.

[66]  David Hung-Chang Du,et al.  BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data deduplication on flash , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[67]  Benny Pinkas,et al.  Side Channels in Cloud Services: Deduplication in Cloud Storage , 2010, IEEE Security & Privacy.

[68]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[69]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[70]  William H. Sanders,et al.  Modeling the Fault Tolerance Consequences of Deduplication , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[71]  Philip Shilane,et al.  Delta Compressed and Deduplicated Storage Using Stream-Informed Locality , 2012, HotStorage.

[72]  Dan Feng,et al.  Scalable high performance de-duplication backup via hash join , 2009, Journal of Zhejiang University SCIENCE C.

[73]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[74]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[75]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[76]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[77]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[78]  Steven Hand,et al.  Satori: Enlightened Page Sharing , 2009, USENIX Annual Technical Conference.

[79]  Hong Jiang,et al.  MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[80]  Pin Zhou,et al.  Demystifying data deduplication , 2008, Companion '08.

[81]  Randal C. Burns,et al.  Efficient distributed backup with delta compression , 1997, IOPADS '97.

[82]  Christos T. Karamanolis,et al.  Evaluation of Efficient Archival Storage Techniques , 2004, MSST.

[83]  Prateek Sharma,et al.  Singleton: system-wide page deduplication in virtual environments , 2012, HPDC '12.

[84]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[85]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .