Transposition invariant string matching

Given strings A = a1a2...am and B = b1b2...bn over an alphabet Σ ⊆ U, where U is some numerical universe closed under addition and subtraction, and a distance function d(A, B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is mint∈U{d(A + t, B)}, where A + t = (a1 + t)(a2 + t)...(am + t). We study the problem of computing the transposition invariant distance for various distance (and similarity) functions d, including Hamming distance, longest common sabsequence (LCS), Levenshtein distance, and their versions where the exact matching condition is replaced by an approximate one. For all these problems we give algorithms whose time complexities are close to the known upper bounds without transposition invariance, and for some we achieve these upper bounds. In particular, we show how sparse dynamic programming can be used to solve transposition invariant problems, and its connection with multidimensional range-minimum search. As a byproduct, we give improved sparse dynamic programming algorithms to compute LCS and Levenshtein distance.

[1]  Gad M. Landau,et al.  A sub-quadratic sequence alignment algorithm for unrestricted cost matrices , 2002, SODA '02.

[2]  Peter van Emde Boas,et al.  Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space , 1977, Inf. Process. Lett..

[3]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[4]  Maxime Crochemore,et al.  Algorithms For Computing Approximate Repetitions In Musical Sequences , 2002, Int. J. Comput. Math..

[5]  Gonzalo Navarro,et al.  Rotation and lighting invariant template matching , 2004, Inf. Comput..

[6]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[7]  Matthew J. Dovey A Technique for Regular Expression Style Searching in Polyphonic Music , 2001, ISMIR.

[8]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[9]  Dimitrios Gunopulos,et al.  Episode Matching , 1997, CPM.

[10]  Kimmo Fredriksson Rotation Invariant Template Matching , 2001 .

[11]  Veli Mäkinen,et al.  On minimizing pattern splitting in multi-track string matching , 2003, J. Discrete Algorithms.

[12]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[14]  Maxime Crochemore,et al.  Approximate String Matching in Musical Sequences , 2001, Stringology.

[15]  Dimitrios Gunopulos,et al.  Time-series similarity problems and well-separated geometric sets , 1997, SCG '97.

[16]  D. Lipman,et al.  THE CONTEXT DEPENDENT COMPARISON OF BIOLOGICAL SEQUENCES , 1984 .

[17]  Tsakalidis,et al.  Approximate String Mat hing with GapsMaxime Cro hemore , 2007 .

[18]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[19]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  Alberto Apostolico,et al.  The longest common subsequence problem revisited , 1987, Algorithmica.

[21]  Jorma Tarhio,et al.  Searching monophonic patterns within polyphonic sources , 2000 .

[22]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[23]  Rajeev Raman,et al.  String-Matching techniques for musical similarity and melodic recognition , 1998 .

[24]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[25]  Gad M. Landau,et al.  A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices , 2003, SIAM J. Comput..

[26]  Richard Cole,et al.  Tree pattern matching and subset matching in randomized O(nlog3m) time , 1997, STOC '97.

[27]  Wojciech Rytter,et al.  Approximate String Matching with Gaps , 2002, Nord. J. Comput..

[28]  Gonzalo Navarro,et al.  Flexible and Efficient Bit-Parallel Techniques for Transposition Invariant Approximate Matching in Music Retrieval , 2003, SPIRE.

[29]  Veli Mäkinen,et al.  Parameterized Approximate String Matching and Local-Similarity-Based Point-Pattern Matching , 2003 .

[30]  Richard Cole,et al.  Tree pattern matching and subset matching in deterministic O(n log3 n)-time , 1999, SODA '99.

[31]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[32]  Gonzalo Navarro,et al.  A bit-parallel suffix automaton approach for (δ, γ)-matching in music retrieval , 2003 .

[33]  S. Muthukrishnan,et al.  New Results and Open Problems Related to Non-Standard Stringology , 1995, CPM.

[34]  Robert E. Tarjan,et al.  Scaling and related techniques for geometry problems , 1984, STOC '84.

[35]  Gonzalo Navarro,et al.  Algorithms for Transposition Invariant StringMat hing ( Extended Abstra t ) ? , 2002 .

[36]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[37]  Richard Cole,et al.  Verifying candidate matches in sparse and wildcard matching , 2002, STOC '02.

[38]  A. Salomaa Regular expression , 2003 .

[39]  Wojciech Plandowski,et al.  Three heuristics for δ-matching: δ-BM algorithms , 2002 .

[40]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[41]  Gonzalo Navarro,et al.  Bit-Parallel Branch and Bound Algorithm for Transposition Invariant LCS , 2004, SPIRE.

[42]  Zvi Galil,et al.  Dynamic Programming with Convexity, Concavity, and Sparsity , 1992, Theor. Comput. Sci..

[43]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[44]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[45]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[46]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[47]  Wojciech Plandowski,et al.  Three Heuristics for delta-Matching: delta-BM Algorithms , 2002, CPM.

[48]  Esko Ukkonen,et al.  Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval , 2003 .

[49]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.