Migratory compression: coarse-grained data reordering to improve compressibility

We propose Migratory Compression (MC), a coarse-grained data transformation, to improve the effectiveness of traditional compressors in modern storage systems. In MC, similar data chunks are re-located together, to improve compression factors. After decompression, migrated chunks return to their previous locations. We evaluate the compression effectiveness and overhead of MC, explore reorganization approaches on a variety of datasets, and present a prototype implementation of MC in a commercial deduplicating file system. We also compare MC to the more established technique of delta compression, which is significantly more complex to implement within file systems. We find that Migratory Compression improves compression effectiveness compared to traditional compressors, by 11% to 105%, with relatively low impact on run-time performance. Frequently, adding MC to a relatively fast compressor like gzip results in compression that is more effective in both space and runtime than slower alternatives. In archival migration, MC improves gzip compression by 44-157%. Most importantly, MC can be implemented in broadly used, modern file systems.

[1]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[2]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[3]  David G. Korn,et al.  Engineering a Differencing and Compression Data Format , 2002, USENIX Annual Technical Conference, General Track.

[4]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[5]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Angelos Bilas,et al.  Using transparent compression to improve SSD-based I/O caches , 2010, EuroSys '10.

[8]  Edward R. Fiala,et al.  Data compression with finite windows , 1989, CACM.

[9]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[10]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[11]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[12]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[13]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[14]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[15]  KyoungSoo Park,et al.  Supporting Practical Content-Addressable Caching with CZIP Compression , 2007, USENIX Annual Technical Conference.

[16]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[17]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[18]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[19]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM 2000.

[20]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[21]  Jeff Gilchrist Elytra PARALLEL DATA COMPRESSION WITH BZIP 2 , 2003 .

[22]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[23]  Grant Wallace,et al.  Efficiently Storing Virtual Machine Backups , 2013, HotStorage.

[24]  Thomas R. Gross,et al.  Adaptive Main Memory Compression , 2005, USENIX Annual Technical Conference, General Track.

[25]  Butler W. Lampson,et al.  On-line data compression in a log-structured file system , 1992, ASPLOS V.

[26]  Philip Shilane,et al.  Delta Compressed and Deduplicated Storage Using Stream-Informed Locality , 2012, HotStorage.