A Comprehensive Study of the Past, Present, and Future of Data Deduplication

Data deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems due to the explosive growth of digital data. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature (i.e., collision-resistant fingerprint), which is shown to be much more computationally efficient than the traditional compression approaches in large-scale storage systems. In this paper, we first review the background and key features of data deduplication, then summarize and classify the state-of-the-art research in data deduplication according to the key workflow of the data deduplication process. The summary and taxonomy of the state of the art on deduplication help identify and understand the most important design considerations for data deduplication systems. In addition, we discuss the main applications and industry trend of data deduplication, and provide a list of the publicly available sources for deduplication research and studies. Finally, we outline the open problems and future research directions facing deduplication-based storage systems.

[1]  Marvin Theimer,et al.  Reclaiming space from duplicate files in a serverless distributed file system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[2]  Xin Wang,et al.  QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services , 2017, IEEE Transactions on Mobile Computing.

[3]  Yang Tang,et al.  Secure Overlay Cloud Storage with Access Control and Assured Deletion , 2012, IEEE Transactions on Dependable and Secure Computing.

[4]  Cezary Dubnicki,et al.  Anchor-driven subchunk deduplication , 2011, SYSTOR '11.

[5]  Vyas Sekar,et al.  SmartRE: an architecture for coordinated network-wide redundancy elimination , 2009, SIGCOMM '09.

[6]  A. Bevan The data deluge , 2015, Antiquity.

[7]  Gang Wang,et al.  Adaptive pipeline for deduplication , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Fred Douglis,et al.  Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[9]  Torsten Suel,et al.  zdelta: An efficient delta compression tool , 2002 .

[10]  C. Chandrasekar,et al.  A SURVEY ON DEDUPLICATION METHODS , 2012 .

[11]  John Black,et al.  Compare-by-Hash: A Reasoned Analysis , 2006, USENIX ATC, General Track.

[12]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[13]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[14]  Jian Liu,et al.  PLC-cache: Endurable SSD cache for deduplication-based primary storage , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Fei Xie,et al.  Estimating Duplication by Content-based Sampling , 2013, USENIX Annual Technical Conference.

[16]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[17]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[18]  André Brinkmann,et al.  File recipe compression in data deduplication systems , 2013, FAST.

[19]  Jon B. Weissman,et al.  ViDeDup: An Application-Aware Framework for Video De-duplication , 2011, HotStorage.

[20]  Hong Jiang,et al.  Accelerating Data Deduplication by Exploiting Pipelining and Parallelism with Multicore or Manycore Processors , 2012 .

[21]  Ki-Woong Park,et al.  GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system , 2012, PMAM '12.

[22]  David A. Bader State of the Journal , 2014, IEEE Trans. Parallel Distributed Syst..

[23]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[24]  Ralph C. Merkle,et al.  A Certified Digital Signature , 1989, CRYPTO.

[25]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Hong Jiang,et al.  FAST: Near Real-Time Searchable Data Analytics for the Cloud , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Philip Shilane,et al.  Memory efficient sanitization of a deduplicated storage system , 2013, FAST.

[28]  Yafei Dai,et al.  PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication , 2010, 2010 IEEE Tenth International Conference on Peer-to-Peer Computing (P2P).

[29]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[30]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[31]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[32]  Suresh Jagannathan,et al.  Fingerdiff : Improved Duplicate Elimination in Storage Systems , 2006 .

[33]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[34]  Mingqiang Li,et al.  CDStore: Toward Reliable, Secure, and Cost-Efficient Cloud Storage via Convergent Dispersal , 2015, IEEE Internet Computing.

[35]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[36]  Tzi-cker Chiueh,et al.  A scalable deduplication and garbage collection engine for incremental backup , 2013, SYSTOR '13.

[37]  Dan Feng,et al.  Reducing Fragmentation for In-line Deduplication Backup Storage via Exploiting Backup History and Cache Knowledge , 2016, IEEE Transactions on Parallel and Distributed Systems.

[38]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[39]  Dongsheng Wang,et al.  A Novel Optimization Method to Improve De-duplication Storage System Performance , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[40]  William H. Sanders,et al.  Modeling the Fault Tolerance Consequences of Deduplication , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[41]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[42]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[43]  Petros Koutoupis Data deduplication with Linux , 2011 .

[44]  Anand Sivasubramaniam,et al.  Leveraging Value Locality in Optimizing NAND Flash-based SSDs , 2011, FAST.

[45]  Darrell D. E. Long,et al.  Secure data deduplication , 2008, StorageSS '08.

[46]  Jeramiah Bowling Opendedup: open-source deduplication put to the test , 2013 .

[47]  Patrick P. C. Lee,et al.  RevDedup: a reverse deduplication storage system optimized for reads to latest backups , 2013, APSys.

[48]  Himabindu Pucha,et al.  Efficient Similarity Estimation for Systems Exploiting Data Redundancy , 2010, 2010 Proceedings IEEE INFOCOM.

[49]  André Brinkmann,et al.  Multi-level comparison of data deduplication in a backup scenario , 2009, SYSTOR '09.

[50]  Jie Ma,et al.  Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration , 2010, 2010 IEEE International Conference on Cluster Computing.

[51]  Mihir Bellare,et al.  Message-Locked Encryption and Secure Deduplication , 2013, EUROCRYPT.

[52]  Cheng Li,et al.  Nitro: A Capacity-Optimized SSD Cache for Primary Storage , 2014, USENIX Annual Technical Conference.

[53]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[54]  João Paulo,et al.  DEDISbench: A Benchmark for Deduplicated Storage Systems , 2012, OTM Conferences.

[55]  Mihir Bellare,et al.  DupLESS: Server-Aided Encryption for Deduplicated Storage , 2013, USENIX Security Symposium.

[56]  Chengwei Zhang,et al.  Leap-based Content Defined Chunking — Theory and Implementation , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[57]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[58]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[59]  Matei Ripeanu,et al.  StoreGPU: exploiting graphics processing units to accelerate distributed storage systems , 2008, HPDC '08.

[60]  Bing Zhou,et al.  Hysteresis Re-chunking Based Metadata Harnessing Deduplication of Disk Images , 2013, 2013 42nd International Conference on Parallel Processing.

[61]  Nikolaj Bjørner,et al.  Content-dependent chunking for differential compression, the local maximum approach , 2010, J. Comput. Syst. Sci..

[62]  C. M. Riggle,et al.  Design of error correction systems for disk drives , 1998 .

[63]  Matei Ripeanu,et al.  A GPU accelerated storage system , 2010, HPDC '10.

[64]  David Hung-Chang Du,et al.  Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[65]  Himabindu Pucha,et al.  Exploiting Similarity for Multi-Source Downloads Using File Handprints , 2007, NSDI.

[66]  Maohua Lu,et al.  Quick Estimation of Data Compression and De-duplication for Large Storage Systems , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[67]  Minglong Shao,et al.  Storage Efficiency Opportunities and Analysis for Video Repositories , 2015, HotStorage.

[68]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[69]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2013, The Kluwer international series in engineering and computer science.

[70]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[71]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[72]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[73]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[74]  Christoph Neumann,et al.  Improving the Resistance to Side-Channel Attacks on Cloud Storage Services , 2012, 2012 5th International Conference on New Technologies, Mobility and Security (NTMS).

[75]  Umesh Deshpande,et al.  Live gang migration of virtual machines , 2011, HPDC '11.

[76]  André Brinkmann,et al.  Block locality caching for data deduplication , 2013, SYSTOR '13.

[77]  Xue Liu,et al.  Smart in-network deduplication for storage-aware SDN , 2013, SIGCOMM.

[78]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[79]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[80]  John C. S. Lui,et al.  Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud , 2011, Middleware.

[81]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[82]  Jia Xu,et al.  Weak leakage-resilient client-side deduplication of encrypted data in cloud storage , 2013, ASIA CCS '13.

[83]  David Hung-Chang Du,et al.  Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[84]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[85]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[86]  Benny Pinkas,et al.  Side Channels in Cloud Services: Deduplication in Cloud Storage , 2010, IEEE Security & Privacy.

[87]  Min Xu,et al.  Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage , 2014, TOS.

[88]  Shouhuai Xu,et al.  Secure and efficient proof of storage with deduplication , 2012, CODASPY '12.

[89]  Anne-Marie Kermarrec,et al.  Probabilistic deduplication for cluster-based storage systems , 2012, SoCC '12.

[90]  Anísio Lacerda,et al.  Minimal perfect hashing: A competitive method for indexing internal memory , 2011, Inf. Sci..

[91]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[92]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[93]  Aiko Pras,et al.  Benchmarking personal cloud storage , 2013, Internet Measurement Conference.

[94]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[95]  Nikolaj Bjørner,et al.  Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression , 2006 .

[96]  Hong Jiang,et al.  CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[97]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[98]  Pin Zhou,et al.  Demystifying data deduplication , 2008, Companion '08.

[99]  Randal C. Burns,et al.  Efficient distributed backup with delta compression , 1997, IOPADS '97.

[100]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[101]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[102]  Frank Bellosa,et al.  XLH: More Effective Memory Deduplication Scanners Through Cross-layer Hints , 2013, USENIX Annual Technical Conference.

[103]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[104]  Robert Ricci,et al.  Metadata Considered Harmful...to Deduplication , 2015, HotStorage.

[105]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[106]  Hong Jiang,et al.  DEBAR: A scalable high-performance de-duplication storage system for backup and archiving , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[107]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[108]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[109]  Bin Yan,et al.  R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems , 2009, ICS '09.

[110]  Donald E. Eastlake,et al.  US Secure Hash Algorithm 1 (SHA1) , 2001, RFC.

[111]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[112]  Roberto Di Pietro,et al.  Boosting efficiency and security in proof of ownership for deduplication , 2012, ASIACCS '12.

[113]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[114]  Jatinder Pal Singh,et al.  Asymmetric caching: improved network deduplication for mobile devices , 2012, Mobicom '12.

[115]  Zhanhuai Li,et al.  Data deduplication techniques , 2010, 2010 International Conference on Future Information Technology and Management Engineering.

[116]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[117]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[118]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[119]  Torsten Suel,et al.  Algorithms for Delta Compression and Remote File Synchronization , 2003 .

[120]  Qing Yang,et al.  I-CASH: Intelligently Coupled Array of SSD and HDD , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[121]  Matei Ripeanu,et al.  DedupT: Deduplication for tape systems , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[122]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[123]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[124]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[125]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[126]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[127]  Darrell D. E. Long,et al.  Providing High Reliability in a Minimum Redundancy Archival Storage System , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[128]  Junfeng Yang,et al.  Secure Deduplication of General Computations , 2015, USENIX Annual Technical Conference.

[129]  Hong Jiang,et al.  SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup , 2010, 2010 39th International Conference on Parallel Processing.

[130]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[131]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[132]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[133]  Edgar R. Weippl,et al.  Dark Clouds on the Horizon: Using Cloud Storage as Attack Vector and Online Slack Space , 2011, USENIX Security Symposium.

[134]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[135]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[136]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[137]  Hong Jiang,et al.  P-Dedupe: Exploiting Parallelism in Data Deduplication System , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[138]  Dalit Naor,et al.  Estimation of deduplication ratios in large data sets , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[139]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[140]  Hong Jiang,et al.  Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets , 2014, 2014 Data Compression Conference.

[141]  Hong Jiang,et al.  SAR: SSD Assisted Restore Optimization for Deduplication-Based Storage Systems in the Cloud , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[142]  Cezary Dubnicki,et al.  Concurrent deletion in a distributed content-addressable storage system with global deduplication , 2013, FAST.

[143]  Jeff Gilchrist Elytra PARALLEL DATA COMPRESSION WITH BZIP 2 , 2003 .

[144]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.

[145]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[146]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[147]  Hong Jiang,et al.  POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[148]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[149]  Xiaozhou Li,et al.  Reliability analysis of deduplicated and erasure-coded storage , 2011, PERV.

[150]  Erez Zadok,et al.  Dmdedup : Device Mapper Target for Data Deduplication , 2014 .

[151]  Jin Li,et al.  Secure Deduplication with Efficient and Reliable Convergent Key Management , 2014, IEEE Transactions on Parallel and Distributed Systems.

[152]  Walter F. Tichy,et al.  Rcs — a system for version control , 1985, Softw. Pract. Exp..

[153]  Yucheng Zhang,et al.  SecDep: A user-aware efficient fine-grained secure deduplication scheme with multi-level key management , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[154]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[155]  Hong Jiang,et al.  DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads , 2016, IEEE Transactions on Computers.

[156]  George Varghese,et al.  EndRE: An End-System Redundancy Elimination Service for Enterprises , 2010, NSDI.

[157]  Jan-Michael Frahm,et al.  Cloud-scale Image Compression Through Content Deduplication , 2014, BMVC.

[158]  Le Zhang,et al.  Fast and Secure Laptop Backups with Encrypted De-duplication , 2010, LISA.

[159]  Sivan Toledo,et al.  SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks , 2015, FAST.

[160]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[161]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[162]  Aleksey Pesterev,et al.  Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[163]  Jin Li,et al.  Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds , 2014, HotCloud.

[164]  André Brinkmann,et al.  Design of an exact data deduplication cluster , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[165]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[166]  Ethan L. Miller,et al.  HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[167]  Qing Yang,et al.  A New Buffer Cache Design Exploiting Both Temporal and Content Localities , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[168]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[169]  Hong Jiang,et al.  MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[170]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[171]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[172]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[173]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[174]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[175]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[176]  Hong Jiang,et al.  Similarity and Locality Based Indexing for High Performance Data Deduplication , 2015, IEEE Transactions on Computers.

[177]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[178]  Glen G. Langdon,et al.  An Introduction to Arithmetic Coding , 1984, IBM J. Res. Dev..

[179]  Philip Shilane,et al.  Delta Compressed and Deduplicated Storage Using Stream-Informed Locality , 2012, HotStorage.

[180]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[181]  Yifan Yang,et al.  A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems , 2014, ICA3PP.

[182]  Aiko Pras,et al.  Inside dropbox: understanding personal cloud storage services , 2012, Internet Measurement Conference.

[183]  Hong Jiang,et al.  Ddelta: A deduplication-inspired fast delta compression approach , 2014, Perform. Evaluation.

[184]  Aditya Akella,et al.  Redundancy in network traffic: findings and implications , 2009, SIGMETRICS '09.

[185]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[186]  Benny Pinkas,et al.  Proofs of ownership in remote storage systems , 2011, CCS '11.

[187]  Mark R. Nelson,et al.  LZW data compression , 1989 .

[188]  George Varghese,et al.  Difference engine , 2010, OSDI.

[189]  David Hung-Chang Du,et al.  BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data deduplication on flash , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[190]  Matei Ripeanu,et al.  VMFlock: virtual machine co-migration for the cloud , 2011, HPDC '11.

[191]  Xue Liu,et al.  Scheduling Heterogeneous Flows with Delay-Aware Deduplication for Avionics Applications , 2012, IEEE Transactions on Parallel and Distributed Systems.

[192]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[193]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .