Approximate Matching of Run-Length Compressed Strings

We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.

[1]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[2]  Gonzalo Navarro,et al.  Approximate Matching of Run-Length Compressed Strings , 2001, CPM.

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[5]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[6]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[7]  Kimmo Fredriksson Rotation invariant histogram filters for similarity and distance measures between digital images , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[8]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[10]  János Csirik,et al.  An Improved Algorithm for Computing the Edit Distance of Run-Length Coded Strings , 1995, Inf. Process. Lett..

[11]  Gad M. Landau,et al.  Edit distance of run-length encoded strings , 2002, Inf. Process. Lett..

[12]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[13]  Ayumi Shinohara,et al.  Multiple pattern matching in LZW compressed text , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[14]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[15]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[16]  Mikkel Thorup,et al.  String matching in Lempel-Ziv compressed strings , 1995, STOC '95.

[17]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[18]  Kamala Krithivasan,et al.  Efficient two-dimensional pattern matching in the presence of errors , 1987, Inf. Sci..

[19]  Gad M. Landau,et al.  Inplace run-length 2d compressed search , 2000, SODA '00.

[20]  Gad M. Landau,et al.  A sub-quadratic sequence alignment algorithm for unrestricted cost matrices , 2002, SODA '02.

[21]  Gad M. Landau,et al.  Matching for Run-Length Encoded Strings , 1999, J. Complex..

[22]  Robert E. Tarjan,et al.  Deques with Heap Order , 1986, Inf. Process. Lett..

[23]  Setsuo Arikawa,et al.  Faster approximate string matching over compressed text , 2001, Proceedings DCC 2001. Data Compression Conference.