Construction of optimal graphs for bit-vector compression

Bitmaps are data structures occurring often in information retrieval. They are useful; they are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a preprocessing stage, in which bitmaps are first clustered and the clusters used to transform their member bitmaps into sparser ones, that can be more effectively compressed. The clustering method efficiently generates a graph structure on the bitmaps. The results of applying our algorithm to the Bible is presented: for some sets of bitmaps, our method almost doubled the compression savings.

[1]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[2]  Ernst J. Schuegraf Compression of large inverted files with hyperbolic term distribution , 1976, Inf. Process. Manag..

[3]  Shmuel Tomi Klein,et al.  Information Retrieval Tools for Literary Analysis , 1990, DEXA.

[4]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.

[5]  Shmuel Tomi Klein,et al.  Improved hierarchical bit-vector compression in document retrieval systems , 1986, SIGIR '86.

[6]  Simon Stiassny Mathematical analysis of various superimposed coding methods , 1960 .

[7]  Matti Jakobsson Huffman Coding in Bit-Vector Compression , 1978, Inf. Process. Lett..

[8]  Shmuel Tomi Klein,et al.  Improved techniques for processing queries in full-text systems , 1987, SIGIR '87.

[9]  Shmuel Tomi Klein,et al.  Using bitmaps for medium sized information retrieval systems , 1990, Inf. Process. Manag..

[10]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[11]  Oscar Vallarino,et al.  On the use of bit maps for multiple key retrieval , 1976, Conference on Data: Abstraction, Definition and Structure.

[12]  Shmuel Tomi Klein,et al.  Storing text retrieval systems on CD-ROM: compression and encryption considerations , 1989, SIGIR '89.

[13]  Jukka Teuhola,et al.  A Compression Method for Clustered Bit-Vectors , 1978, Inf. Process. Lett..

[14]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[15]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[16]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[17]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .