Exploiting clustering in inverted file compression

Document databases contain large volumes of text, and currently have typical sizes into the gigabyte range. In order to efficiently query these text collections some form of index is required, since without an index even the fastest of pattern matching techniques results in unacceptable response times. One pervasive indexing method is the use of inverted files, also sometimes known as concordances or postings files. There has been a number of effort made to capture the "clustering" effect, and to design index compression methods that condition their probability predictions according to context. In these methods information as to whether or not the most recent (or second most recent, and so on) document contained term t is used to bias the prediction that the next document will contain term t. We further extend this notion of context-based index compression, and describe a surprisingly simple index representation that gives excellent performance on all of our test databases; allows fast decoding; and is, even in the worst case, only slightly inferior to Golomb (1966) coding.

[1]  Shmuel Tomi Klein,et al.  Model based concordance compression , 1992, Data Compression Conference, 1992..

[2]  David C. van Voorhis,et al.  Optimal source codes for geometrically distributed integer alphabets (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[3]  S. Golomb Run-length encodings. , 1966 .

[4]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[5]  Alistair Moffat,et al.  Parameterised compression for sparse bitmaps , 1992, SIGIR '92.

[6]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[7]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[8]  Shmuel Tomi Klein,et al.  Compression of concordances in full-text retrieval systems , 1988, SIGIR '88.

[9]  Shmuel T. Klein,et al.  Modeling word occurrences for the compression of concordances , 1995, Proceedings DCC '95 Data Compression Conference.

[10]  P.G. Howard,et al.  Fast and efficient lossless image compression , 1993, [Proceedings] DCC `93: Data Compression Conference.

[11]  T. Raita,et al.  Markov models for clusters in concordance compression , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[12]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[13]  Ian H. Witten,et al.  Data compression in full-text retrieval systems , 1993 .