Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance

We develop efficient dynamic programming algorithms for pattern matching with general gaps and character classes. We consider patterns of the form p0g(a0,b0)p1g(a1,b1)…pm−1, where pi ⊂ Σ, Σ is some finite alphabet, and g(ai,bi) denotes a gap of length ai…bi between symbols pi and pi+1. The text symbol tj matches pi iff tj ∈ pi. Moreover, we require that if pi matches tj, then pi+1 should match one of the text symbols $$ t_{j+a_i+1} \ldots t_{j+b_i+1}.$$ Either or both of ai and bi can be negative. We also consider transposition invariant matching, i.e., the matching condition becomes tj ∈ pi + τ, for some constant τ determined by the algorithms. We give algorithms that have efficient average and worst case running times. The algorithms have important applications in music information retrieval and computational biology. We give experimental results showing that the algorithms work well in practice.

[1]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[2]  Domenico Cantone,et al.  On Tuning the (\delta, \alpha)-Sequential-Sampling Algorithm for \delta-Approximate Matching with Alpha-Bounded Gaps in Musical Sequences , 2005, ISMIR.

[3]  Domenico Cantone,et al.  An Efficient Algorithm for δ-Approximate Matching with α-Bounded Gaps in Musical Sequences , 2004 .

[4]  Wojciech Rytter,et al.  Approximate String Matching with Gaps , 2002, Nord. J. Comput..

[5]  Eugene W. Myers Approximate matching of network expressions with spacers. , 1996 .

[6]  Szymon Grabowski,et al.  Efficient Bit-Parallel Algorithms for (delta, alpha)-Matching , 2006, WEA.

[7]  Gonzalo Navarro,et al.  Fast and flexible string matching by combining bit-parallelism and suffix automata , 2000, JEAL.

[8]  Shu Wang,et al.  Pattern-Matching with Bounded Gaps in Genomic Sequences , 2009, Rev. Colomb. de Computación.

[9]  Veli Mäkinen,et al.  Parameterized Approximate String Matching and Local-Similarity-Based Point-Pattern Matching , 2003 .

[10]  Wojciech Plandowski,et al.  Speeding Up Two String-Matching Algorithms , 1992, STACS.

[11]  Donald B. Johnson,et al.  A priority queue in which initialization and queue operations takeO(loglogD) time , 1981, Mathematical systems theory.

[12]  Gonzalo Navarro,et al.  Bit-parallel (delta, gamma)-matching and suffix automata , 2005, J. Discrete Algorithms.

[13]  G. Mehldau,et al.  A system for pattern matching applications on biosequences , 1993, Comput. Appl. Biosci..

[14]  Gonzalo Navarro,et al.  Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching , 2003, J. Comput. Biol..

[15]  Domenico Cantone,et al.  An Efficient Algorithm for alpha-Approximate Matching with delta-Bounded Gaps in Musical Sequences , 2005, WEA.

[16]  Gonzalo Navarro,et al.  Transposition invariant string matching , 2005, J. Algorithms.

[17]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.