Engineering the compression of massive tables: an experimental approach

We study the problem of compressing massive tables. We devise a novel compression paradigm—training for lossless compression— which assumes that the data exhibit dependencies that can be learned by examining a small amount of training material. We develop an experimental methodology to test the approach. Our result is a system, pzip, which outperforms gzip by factors of two in compression size and both compression and uncompression time for various tabular data. Pzip is now in production use in an AT&T network traffic data warehouse.

[1]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[4]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[5]  A. Gray,et al.  Distortion performance of vector quantization for LPC voice coding , 1982 .

[6]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[7]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[8]  Gordon V. Cormack,et al.  Data compression on a database system , 1985, CACM.

[9]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[10]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[11]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[12]  P. C. Hariharan,et al.  Architecture and design of storage and data management for the NASA Earth observing system Data and Information System (EOSDIS) , 1995, Proceedings of IEEE 14th Symposium on Mass Storage Systems.

[13]  Walter F. Tichy,et al.  An Empirical Study of Delta Algorithms , 1996, SCM.

[14]  An A Fabii,et al.  Improved Approximation Algorithms for Uncapacitated Facility Location , 1998 .

[15]  Rajmohan Rajaraman,et al.  Analysis of a local search heuristic for facility location problems , 2000, SODA '98.

[16]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[17]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[18]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[19]  David B. Shmoys,et al.  Approximation algorithms for facility location problems , 2000, APPROX.