Compression-Domain Text Indexing and Retrieval

Keyword-based text retrieval engines have been and will continue to be essential to text-based information access systems because they serve as the basic building blocks to high-level text analysis systems. Traditionally, text compression and text retrieval are teated as independent problems. Text les are compressed and indexed separately. To answer a keyword-based query, text les are rst uncompressed, and then searched sequentially or via an inverted index. This paper describes the design, implementation and evaluation of a novel integrated text compression and indexing scheme called ITCI, which combines the dictionary data structures for compression and indexing, and allows direct search through compressed text. The performance results show that ITCI's compression eeciency is within 7% to 17% of GZIP, which is among the best lossless data compression algorithms, The sum of the compressed text and the inverted index is only between 55% to 76% of the original text size, while the search performance is comparable to full inverted indexing with dynamic index cache.

[1]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[2]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.