General-purpose compression for efficient retrieval

Compression of databases not only reduces space requirements but can also reduce overall retrieval times. In text databases, compression of documents based on semistatic modeling with words has been shown to be both practical and fast. Similarly, for specific applications—such as databases of integers or scientific databases—specially designed semistatic compression schemes work well. We propose a scheme for general‐purpose compression that can be applied to all types of data stored in large collections. We describe our approach—which we call RAY—in detail, and show experimentally the compression available, compression and decompression costs, and performance as a stream and random‐access technique. We show that, in many cases, RAY achieves better compression than an efficient Huffman scheme and popular adaptive compression techniques, and that it can be used as an efficient general‐purpose compression scheme.

[1]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[2]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[3]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[4]  Hugh E. Williams,et al.  Trends in retrieval system performance , 2000, Proceedings 23rd Australasian Computer Science Conference. ACSC 2000 (Cat. No.PR00518).

[5]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[6]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[7]  Ian H. Witten,et al.  Phrase hierarchy inference and compression in bounded space , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[8]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[9]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[10]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[11]  Hugh E. Williams,et al.  A compression scheme for large databases , 2000, Proceedings 11th Australasian Database Conference. ADC 2000 (Cat. No.PR00528).

[12]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[13]  Alistair Moffat,et al.  Text Compression for Dynamic Document Databases , 1997, IEEE Trans. Knowl. Data Eng..

[14]  Hugh E. Williams,et al.  Compression of nucleotide databases for fast searching , 1997, Comput. Appl. Biosci..

[15]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[16]  Hans-Werner Mewes,et al.  The PIR-International Protein Sequence Database , 1992, Nucleic Acids Res..

[17]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.