Improving table compression with combinatorial optimization

We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on the theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-line training algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On various test files, the on-line algorithms provide 35-55% improvement over gzip with negligible slowdown; the off-line reordering provides up to 20% further improvement over partitioning alone. We also show that a variation of the table compression problem is MAX-SNP hard.

[1]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[2]  Kenneth Ward Church,et al.  Engineering the compression of massive tables: an experimental approach , 2000, SODA '00.

[3]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[4]  Ronald V. Book,et al.  Review: Michael R. Garey and David S. Johnson, Computers and intractability: A guide to the theory of $NP$-completeness , 1980 .

[5]  Mihalis Yannakakis,et al.  The Traveling Salesman Problem with Distances One and Two , 1993, Math. Oper. Res..

[6]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[7]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[8]  Flavio Licciulli,et al.  Update of AMmtDB: a database of multi-aligned metazoa mitochondrial DNA sequences , 1999, Nucleic Acids Res..

[9]  Weixiong Zhang,et al.  The Asymmetric Traveling Salesman Problem: Algorithms, Instance Generators, and Tests , 2001, ALENEX.

[10]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Carsten Lund,et al.  Proof verification and the hardness of approximation problems , 1998, JACM.

[15]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1991, STOC '91.

[16]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[17]  Gordon V. Cormack,et al.  Data compression on a database system , 1985, CACM.

[18]  Richard M. Karp,et al.  The traveling-salesman problem and minimum spanning trees: Part II , 1971, Math. Program..

[19]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[20]  Carsten Lund,et al.  Proof verification and hardness of approximation problems , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[21]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[22]  SudanMadhu,et al.  Proof verification and the hardness of approximation problems , 1998 .

[23]  Richard M. Karp,et al.  The Traveling-Salesman Problem and Minimum Spanning Trees , 1970, Oper. Res..