A compression scheme for large databases

Compression of databases not only reduces space requirements but can also reduce overall retrieval times. We have described elsewhere our RAY algorithm for compressing databases containing general-purpose data, such as images, sound, and also text. We describe here an extension to the RAY compression algorithm that permits use on very large databases. In this approach, we build a model based on a small training set and use the model to compress large databases. Our preliminary implementation is slow for compression, but only slightly slower in decompression speed than the popular GZIP scheme. Importantly, we show that the compression effectiveness of our approach is excellent and markedly better than the GZIP and COMPRESS algorithms on our test sets.

[1]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[2]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[3]  Hugh E. Williams,et al.  Combined models for high-performance compression of large text collections , 1999 .

[4]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[5]  Alistair Moffat,et al.  Text Compression for Dynamic Document Databases , 1997, IEEE Trans. Knowl. Data Eng..

[6]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[7]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[8]  Hugh E. Williams,et al.  General-purpose compression for efficient retrieval , 2001 .

[9]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[10]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[11]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[12]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[13]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.