DEC: An Efficient Deduplication-Enhanced Compression Approach

Data compression is widely used in storage systems to reduce redundant data and thus save storage space. One challenge facing the traditional compression approaches is the limitation of compression windows size, which fails to reduce redundancy globally. In this paper, we present DEC, a Deduplication-Enhanced Compression approach that effectively combines deduplication and traditional compressors to increase compression ratio and efficiency. Specifically, we make full use of deduplication to (1) accelerate data reduction by fast but global deduplication and (2) exploit data locality to compress similar chunks by clustering the data chunks which are adjacent to the same duplicate chunks. Our experimental results of a DEC prototype based on real-world datasets show that DEC increases the compression ratio by 20% to 71% and speeds up the compression throughput by 17%~183% compared to traditional compressors, without sacrificing the decompression throughput by leveraging deduplication in traditional compression approaches.

[1]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[4]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[5]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[6]  MaziéresDavid,et al.  A low-bandwidth network file system , 2001 .

[7]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[8]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[9]  Jeff Gilchrist Elytra PARALLEL DATA COMPRESSION WITH BZIP 2 , 2003 .

[10]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[11]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[12]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[13]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[14]  David D. Chambliss,et al.  Mixing Deduplication and Compression on Active Data Sets , 2011, 2011 Data Compression Conference.

[15]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[16]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.

[17]  Hong Jiang,et al.  Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets , 2014, 2014 Data Compression Conference.

[18]  Hong Jiang,et al.  Ddelta: A deduplication-inspired fast delta compression approach , 2014, Perform. Evaluation.

[19]  Cheng Li,et al.  Nitro: A Capacity-Optimized SSD Cache for Primary Storage , 2014, USENIX Annual Technical Conference.

[20]  Fred Douglis,et al.  Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[21]  Yucheng Zhang,et al.  SecDep: A user-aware efficient fine-grained secure deduplication scheme with multi-level key management , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[22]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[23]  Hong Jiang,et al.  A Comprehensive Study of the Past, Present, and Future of Data Deduplication , 2016, Proceedings of the IEEE.

[24]  D. K. S. Babu,et al.  DARE : A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads , 2018 .