Approximate string matching on Ziv-Lempel compressed text

We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions. On LZ78/LZW we need O(mkn + R) time in the worst case and O(k2n + mk min(n, (mσ)k) + R) on average where σ is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.

[1]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[2]  Ayumi Shinohara,et al.  Multiple pattern matching in LZW compressed text , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[3]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[4]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[5]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[6]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[7]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.

[8]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[9]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[10]  I. H. Öğüş,et al.  NATO ASI Series , 1997 .

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  Ayumi Shinohara,et al.  Bit-parallel approach to approximate string matching in compressed texts , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[13]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[14]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[15]  Ricardo A. Baeza-Yates,et al.  Very Fast and Simple Approximate String Matching , 1999, Inf. Process. Lett..

[16]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[17]  Jordan Lampe,et al.  Theoretical and Empirical Comparisons of Approximate String Matching Algorithms , 1992, CPM.

[18]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[19]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[20]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[21]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[22]  Robert E. Tarjan,et al.  Applications of Path Compression on Balanced Trees , 1979, JACM.

[23]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[24]  Gonzalo Navarro,et al.  Boyer-Moore String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[25]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[26]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[27]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[28]  Setsuo Arikawa,et al.  Faster approximate string matching over compressed text , 2001, Proceedings DCC 2001. Data Compression Conference.

[29]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[30]  Mark N. Wegman,et al.  Variations on a theme by Ziv and Lempel , 1985 .

[31]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[32]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[33]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[34]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[35]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[36]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .