Parameterised compression for sparse bitmaps

Full-text retrieval systems often use either a bitmap or an inverted file to identify which documents contain which terms, so that the documents containing any combination of query terms can be quickly located. Bitmaps of term occurrences are large, but are usually sparse, and thus are amenable to a variety of compression techniques. Here we consider techniques in which the encoding of each bitvector within the bitmap is parameterised, so that a different code can be used for each bitvector. Our experimental results show that the new methods yield better compression than previous techniques.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Ernst J. Schuegraf Compression of large inverted files with hyperbolic term distribution , 1976, Inf. Process. Manag..

[3]  Alistair Moffat,et al.  Coding for compression in full-text retrieval systems , 1992, Data Compression Conference, 1992..

[4]  David C. van Voorhis,et al.  Optimal source codes for geometrically distributed integer alphabets (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[5]  S. Golomb Run-length encodings. , 1966 .

[6]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[7]  Shmuel Tomi Klein,et al.  Generative models for bitmap sets with compression applications: (extended abstract) , 1991, SIGIR '91.

[8]  Shmuel Tomi Klein,et al.  Compression of concordances in full-text retrieval systems , 1988, SIGIR '88.

[9]  M. Douglas,et al.  Development of a Spelling List , 1982 .

[10]  M. D. McIlroy,et al.  Development of a Spelling List , 1982, IEEE Trans. Commun..

[11]  Shmuel Tomi Klein,et al.  Improved hierarchical bit-vector compression in document retrieval systems , 1986, SIGIR '86.

[12]  Shmuel Tomi Klein,et al.  Model based concordance compression , 1992, Data Compression Conference, 1992..

[13]  Matti Jakobsson Huffman Coding in Bit-Vector Compression , 1978, Inf. Process. Lett..

[14]  W. Bruce Croft,et al.  Implementing ranking strategies using text signatures , 1988, TOIS.

[15]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[16]  Ian H. Witten,et al.  Models for compression in full-text retrieval systems , 1991, [1991] Proceedings. Data Compression Conference.

[17]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[18]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[19]  Shmuel Tomi Klein,et al.  Compression of correlated bit-vectors , 1991, Inf. Syst..

[20]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[21]  Alistair Moffat,et al.  Economical Inversion of Large Text Files , 1992, Comput. Syst..

[22]  A. Bookstein,et al.  Flexible compression for bitmap sets , 1991, [1991] Proceedings. Data Compression Conference.

[23]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..