Fast and flexible word searching on compressed text

We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.

[1]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[2]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[3]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[4]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[5]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[6]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[7]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[8]  Ricardo A. Baeza-Yates,et al.  Direct pattern matching on compressed text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[9]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[10]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[11]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[12]  Alistair Moffat,et al.  Fast file search using text compression , 1997 .

[13]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[14]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[15]  Daniel S. Hirschberg,et al.  Efficient decoding of prefix codes , 1990, CACM.

[16]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[17]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[18]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[19]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[20]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[21]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[22]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[23]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[24]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[25]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[26]  Ian H. Witten,et al.  Data compression in full-text retrieval systems , 1993 .

[27]  Alistair Moffat,et al.  In-Place Calculation of Minimum-Redundancy Codes , 1995, WADS.

[28]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[29]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[30]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[31]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.

[32]  Ian H. Witten,et al.  Data Compression in Full-Text Retrieval Systems , 1993, J. Am. Soc. Inf. Sci..

[33]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[34]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[35]  Ayumi Shinohara,et al.  Multiple pattern matching in LZW compressed text , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[36]  G. H. Gonnet,et al.  Handbook of algorithms and data structures: in Pascal and C (2nd ed.) , 1991 .

[37]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[38]  Mikkel Thorup,et al.  String matching in Lempel-Ziv compressed strings , 1995, STOC '95.

[39]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[40]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[41]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[42]  R. Nigel Horspool,et al.  Constructing word-based text compression algorithms , 1992, Data Compression Conference, 1992..

[43]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[44]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..