On approximate pattern matching with thresholds

Abstract In the traditional version of the problem of approximate pattern matching, a pattern symbol is considered to match a text symbol if the two symbols are equal. Such a notion of exact equality is not suitable for situations where the text and pattern symbols are imprecise, e.g., obtained from an analog source, distorted by additive noise, etc. In such situations it is more appropriate to consider two alphabet symbols to match even if they are not equal, as long as they do not differ by more than a given threshold θ. The goal is then to compute the number of matches of the length-M pattern with all length-M substrings of the length-N text, i.e., to compute a vector of N − M + 1 scores, where the ith score is the number of matches between the pattern and the substring that begins at text position i. The main result of this paper is to show that this threshold version of the problem can be solved by recursively solving 3 + 2 log ⁡ θ instances of the traditional (i.e., zero-threshold) version of the problem, which is much-studied in the literature and for which there are many efficient (typically randomized) solutions of time complexity close to O ( N log ⁡ M ) . This paper's result therefore implies the first randomized O ( N log ⁡ M ( log ⁡ θ + 1 ) ) solution for the threshold version of the problem. It also implies that any future improvement to the traditional (zero-threshold) version of the problem automatically translates into a similar improvement to the arbitrary-threshold case. Furthermore, we show that the factor Ω ( log ⁡ θ ) is tight if use our recursive framework.

[1]  V AhoAlfred,et al.  Efficient string matching , 1975 .

[2]  Costas S. Iliopoulos,et al.  Faster Algorithms for delta, gamma-Matching and Related Problems , 2005, CPM.

[3]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[4]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[5]  Howard J. Karloff Fast Algorithms for Approximately Counting Mismatches , 1993, Inf. Process. Lett..

[6]  K. Fredriksson,et al.  EFFICIENT ALGORITHMS FOR (δ,γ,α) AND (δ, kΔ, α)-MATCHING , 2008 .

[7]  Domenico Cantone,et al.  Efficient Algorithms for the delta-Approximate String Matching Problem in Musical Sequences , 2004, Stringology.

[8]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[9]  Mikhail J. Atallah,et al.  A Randomized Algorithm for Approximate String Matching , 2001, Algorithmica.

[10]  Mikhail J. Atallah,et al.  Pattern matching in the Hamming distance with thresholds , 2011, Inf. Process. Lett..

[11]  Mikhail J. Atallah,et al.  A lower-variance randomized algorithm for approximate string matching , 2013, Inf. Process. Lett..

[12]  Szymon Grabowski,et al.  Exploiting word-level parallelism for fast convolutions and their applications in approximate string matching , 2013, Eur. J. Comb..

[13]  Szymon Grabowski,et al.  Bit-parallel string matching under Hamming distance in O(n[m/w]) worst case time , 2008, Inf. Process. Lett..

[14]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[15]  Maxime Crochemore,et al.  Occurrence and Substring Heuristics for i-Matching , 2003, Fundam. Informaticae.

[16]  Z Galil,et al.  Improved string matching with k mismatches , 1986, SIGA.

[17]  Wojciech Plandowski,et al.  On special families of morphisms related to [delta]-matching and don't care symbols , 2003, Inf. Process. Lett..