A Systematic Approach to Compressing a Full-Text Retrieval System

Abstract This article reports on a variety of compression algorithms developed in the context of a project to put all the data files for a full-text retrieval system on CD-ROM. In the context of inexpensive pre-processing, a text-compression algorithm is presented that is based on Markov-modeled Huffman coding on an extended alphabet. Data structures are examined for facilitating random access into the compressed text. In addition, new algorithms are presented for compression of word indices, both the dictionaries (word lists) and the text pointers (concordances). The ARTFL database is used as a test case throughout the article.

[1]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[2]  David H. Davies The CD‐ROM medium , 1988 .

[3]  Shmuel Tomi Klein,et al.  Storing text retrieval systems on CD-ROM: compression and encryption considerations , 1989, SIGIR '89.

[4]  Ian H. Witten,et al.  Models for compression in full-text retrieval systems , 1991, [1991] Proceedings. Data Compression Conference.

[5]  Paul Bratley,et al.  Processing truncated terms in document retrieval systems , 1982, Inf. Process. Manag..

[6]  Shmuel Tomi Klein,et al.  The ARTFL data compression project , 1991, RIAO.

[7]  Shmuel Tomi Klein,et al.  Compression of concordances in full-text retrieval systems , 1988, SIGIR '88.

[8]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[9]  Shmuel Tomi Klein,et al.  Compression, information theory, and grammars: a unified approach , 1990, TOIS.

[10]  Shmuel Tomi Klein,et al.  Improved techniques for processing queries in full-text systems , 1987, SIGIR '87.

[11]  Shmuel Tomi Klein,et al.  Using bitmaps for medium sized information retrieval systems , 1990, Inf. Process. Manag..

[12]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[13]  Edward M. Cichocki,et al.  Design considerations for CD-ROM retrieval software , 1988, J. Am. Soc. Inf. Sci..

[14]  David H. Davies The CD-ROM medium , 1988, J. Am. Soc. Inf. Sci..

[15]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.