International Journal of Foundations of Computer Science c ○ World Scientific Publishing Company EFFICIENT ALGORITHMS FOR (δ, γ, α) AND

We propose new algorithms for (δ, γ, α)-matching. In this string matching problem we are given a pattern P = p0p1 . . . pm−1 and a text T = t0t1 . . . tn−1 over some integer alphabet Σ = {0 . . . σ − 1}. The pattern symbol pi δ-matches the text symbol tj iff |pi − tj | ≤ δ. The pattern P (δ, γ)-matches some text substring tj . . . tj+m−1 iff for all i it holds that |pi − tj+i| ≤ δ and P |pi − tj+i| ≤ γ. Finally, in (δ, γ, α)-matching we also permit at most α-symbol gaps between each matching text symbol. The only known previous algorithm runs in O(nm) time. We give several algorithms that improve the average case up to O(n) for small α, and the worst case to O(min{nm, |M|α}) or O(nm log(γ)/w), where M = {(i, j) | |pi − tj | ≤ δ} and w is the number of bits in a machine word. The proposed algorithms can be easily modified to solve several other related problems, we explicitly consider e.g. character classes (instead of δ-matching), (∆limited) k-mismatches (instead of γ-matching) and more general gaps, including negative ones. These find important applications in computational biology. We conclude with experimental results showing that the algorithms are very efficient in practice.

[1]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[2]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 1: Sorting and Searching , 2011, EATCS Monographs on Theoretical Computer Science.

[3]  Robert E. Tarjan,et al.  Deques with Heap Order , 1986, Inf. Process. Lett..

[4]  G. Mehldau,et al.  A system for pattern matching applications on biosequences , 1993, Comput. Appl. Biosci..

[5]  E. W. Meyers Approximate Matching of Network Expressions with Spacers , 1996, J. Comput. Biol..

[6]  Wojciech Rytter,et al.  Approximate String Matching with Gaps , 2002, Nord. J. Comput..

[7]  K. Fredriksson,et al.  Efficient Algorithms for ( δ , γ , α )-Matching , 2003 .

[8]  Veli Mäkinen,et al.  Parameterized Approximate String Matching and Local-Similarity-Based Point-Pattern Matching , 2003 .

[9]  Gonzalo Navarro,et al.  Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching , 2003, J. Comput. Biol..

[10]  Maxime Crochemore,et al.  Occurrence and Substring Heuristics for i-Matching , 2003, Fundam. Informaticae.

[11]  Domenico Cantone,et al.  An Efficient Algorithm for δ-Approximate Matching with α-Bounded Gaps in Musical Sequences , 2004 .

[12]  Gonzalo Navarro,et al.  Flexible Music Retrieval in Sublinear Time , 2005, Int. J. Found. Comput. Sci..

[13]  Gonzalo Navarro,et al.  Transposition invariant string matching , 2005, J. Algorithms.

[14]  Domenico Cantone,et al.  ON TUNING THE (,)-SEQUENTIAL-SAMPLING ALGORITHM FOR -APPROXIMATE MATCHING WITH-BOUNDED GAPS IN MUSICAL SEQUENCES , 2005 .

[15]  Donald B. Johnson,et al.  A priority queue in which initialization and queue operations takeO(loglogD) time , 1981, Mathematical systems theory.

[16]  Szymon Grabowski,et al.  Efficient Algorithms for Pattern Matching with General Gaps and Character Classes , 2006, SPIRE.

[17]  Szymon Grabowski,et al.  Efficient bit-parallel algorithms for (δ, α)-matching , 2006 .

[18]  Shu Wang,et al.  Pattern-Matching with Bounded Gaps in Genomic Sequences , 2009, Rev. Colomb. de Computación.