Compression of correlated bit-vectors

Abstract Bitmaps are data structures occurring often in information retrieval. They are useful, but are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a preprocessing stage, in which bitmaps are first clustered and the clusters used to transform their member bitmaps into sparser ones, that can be more effectively compressed. The clustering method efficiently generates a graph structure on the bitmaps. In some situations, it is desired to impose restrictions on the graph; finding the optimal graph satisfying these restrictions is shown to be NP-complete. The results of applying our algorithm to the Bible is presented: for some sets of bitmaps, our method almost doubled in the compression savings.

[1]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[2]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[3]  Shmuel Tomi Klein,et al.  Storing text retrieval systems on CD-ROM: compression and encryption considerations , 1989, SIGIR '89.

[4]  Simon Stiassny Mathematical analysis of various superimposed coding methods , 1960 .

[5]  Matti Jakobsson Huffman Coding in Bit-Vector Compression , 1978, Inf. Process. Lett..

[6]  Ernst J. Schuegraf Compression of large inverted files with hyperbolic term distribution , 1976, Inf. Process. Manag..

[7]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[8]  Oscar Vallarino,et al.  On the use of bit maps for multiple key retrieval , 1976, Conference on Data: Abstraction, Definition and Structure.

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Andrew Chi-Chih Yao,et al.  An O(|E| log log |V|) Algorithm for Finding Minimum Spanning Trees , 1975, Inf. Process. Lett..

[11]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[12]  Shmuel Tomi Klein,et al.  Improved techniques for processing queries in full-text systems , 1987, SIGIR '87.

[13]  Shmuel Tomi Klein,et al.  Using bitmaps for medium sized information retrieval systems , 1990, Inf. Process. Manag..

[14]  Shmuel Tomi Klein,et al.  Improved hierarchical bit-vector compression in document retrieval systems , 1986, SIGIR '86.

[15]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.

[16]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[17]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[18]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .