A general-purpose compression scheme for large collections

Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.

[1]  E. B. James,et al.  Information Compression by Factorising Common Strings , 1975, Computer/law journal.

[2]  Hugh E. Williams,et al.  A general-purpose compression scheme for databases , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[3]  Hugh E. Williams,et al.  General-purpose compression for efficient retrieval , 2001 .

[4]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[5]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.

[6]  Cathy H. Wu,et al.  The PIR-International Protein Sequence Database , 1999, Nucleic Acids Res..

[7]  Hugh E. Williams,et al.  General-purpose compression for efficient retrieval , 2001, J. Assoc. Inf. Sci. Technol..

[8]  Alistair Moffat,et al.  Fast file search using text compression , 1997 .

[9]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[10]  Hugh E. Williams,et al.  A compression scheme for large databases , 2000, Proceedings 11th Australasian Database Conference. ADC 2000 (Cat. No.PR00528).

[11]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[12]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[13]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[14]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[15]  Hugh E. Williams,et al.  In-memory hash tables for accumulating text vocabularies , 2001, Inf. Process. Lett..

[16]  I.H. Witten,et al.  On-line and off-line heuristics for inferring hierarchies of repetitions in sequences , 2000, Proceedings of the IEEE.

[17]  Jon Louis Bentley,et al.  Data compression with long repeated strings , 2001, Inf. Sci..

[18]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[19]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[20]  Alistair Moffat,et al.  Text Compression for Dynamic Document Databases , 1997, IEEE Trans. Knowl. Data Eng..

[21]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[22]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[23]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[24]  Alistair Moffat,et al.  Off-line dictionary-based compression , 2000 .

[25]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[26]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[27]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[28]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[29]  Hugh E. Williams,et al.  Compact in-memory models for compression of large text databases , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[30]  Hans-Werner Mewes,et al.  The PIR-International Protein Sequence Database , 1992, Nucleic Acids Res..

[31]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[32]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[33]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[34]  H. S. Heaps,et al.  A comparison of algorithms for data base compression by use of fragments as language elements , 1974, Inf. Storage Retr..

[35]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[36]  J. Gerard Wolff,et al.  Recoding of Natural Language for Economy of Transmission of Storage , 1978, Comput. J..

[37]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[38]  Frank Rubin,et al.  Experiments in text file compression , 1976, CACM.

[39]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[40]  Robert A. Wagner,et al.  Common phrases and minimum-space text storage , 1973, CACM.

[41]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[42]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[43]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[44]  Hugh E. Williams,et al.  Combined models for high-performance compression of large text collections , 1999 .

[45]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[46]  Jon Louis Bentley,et al.  Data compression using long common strings , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[47]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[48]  Michael F. Lynch,et al.  Compression of bibliographic files using an adaptation of run-length coding , 1973, Inf. Storage Retr..