Fast searching on compressed text allowing errors

We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gaip, respectively. Compression times are close to the times of Compress and approximately half the times of Gzip, and decompression times are lower than those of Gzip and one third of those of Compress. The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without significantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boyer-Moore-type filter. It concomitantly uses the vocabulary of the text available as part of the Huffman coding data. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is twice as fast as running Agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than Agrep. We also mention the impact of our technique in inverted files pointing to documents or logical blocks as Glimpse.

[1]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[2]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[3]  Ian H. Witten,et al.  Data Compression in Full-Text Retrieval Systems , 1993, J. Am. Soc. Inf. Sci..

[4]  Ricardo A. Baeza-Yates,et al.  Average Running Time of the Boyer-Moore-Horspool Algorithm , 1992, Theor. Comput. Sci..

[5]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[6]  Ricardo A. Baeza-Yates,et al.  Direct pattern matching on compressed text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[7]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[8]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[9]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[10]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[11]  R. Nigel Horspool,et al.  Constructing word-based text compression algorithms , 1992, Data Compression Conference, 1992..

[12]  David S. Munro,et al.  In: Software-Practice and Experience , 2000 .

[13]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[14]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[15]  Daniel S. Hirschberg,et al.  Efficient decoding of prefix codes , 1990, CACM.

[16]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[17]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[18]  Ricardo A. Baeza-Yates,et al.  A Faster Algorithm for Approximate String Matching , 1996, CPM.

[19]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[20]  Gonzalo Navarro,et al.  Block addressing indices for approximate text retrieval , 1997, International Conference on Information and Knowledge Management.

[21]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[22]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[23]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[24]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[25]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[26]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[27]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[28]  Gonzalo Navarro,et al.  Multiple Approximate String Matching , 1997, WADS.

[29]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[30]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.