Boyer-Moore String Matching over Ziv-Lempel Compressed Text

We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not improve because still all the symbols in the compressed text have to be scanned, we show experimentally that speedups of up to 30% over the fastest previous approaches are obtained. Moreover, we show that using an encoding method that sacrifices some compression ratio our method is twice as fast as decompressing plus searching using the best available algorithms.

[1]  Ayumi Shinohara,et al.  Multiple pattern matching in LZW compressed text , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[2]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[3]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[4]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[5]  Wojciech Plandowski,et al.  Eecient Algorithms for Lempel-ziv Encoding , 1996 .

[6]  Jorma Tarhio,et al.  String Matching in the DNA Alphabet , 1997, Softw. Pract. Exp..

[7]  Wojciech Plandowski,et al.  Efficient algorithms for Lempel-Ziv encoding , 1996 .

[8]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[9]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[10]  Wojciech Plandowski,et al.  Efficient Algorithms for Lempel-Zip Encoding (Extended Abstract) , 1996, SWAT.

[11]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[12]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.

[13]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[14]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[15]  Vineet Bafna,et al.  Pattern Matching Algorithms , 1997 .

[16]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[17]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[18]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[19]  Jorma Tarhio,et al.  String matching in the DNA alphabet , 1997 .

[20]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[21]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[22]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[23]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[24]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[25]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[26]  Wojciech Rytter,et al.  Text Algorithms , 1994 .