The k-mismatch problem revisited

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: • Our first result is a deterministic O(nk2 log k/m + n polylog m) time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 [9, 10]. • We then give a randomised and online algorithm which runs in the same time complexity but requires only O(k2 polylog m) space in total. • Next we give a randomised (1 + e)-approximation algorithm for the streaming k-mismatch problem which uses O(k2 polylog m/e2) space and runs in O(polylog m/e2) worst-case time per arriving symbol. • Finally we combine our new results to derive a randomised O(k2 polylog m) space algorithm for the streaming k-mismatch problem which runs in O([EQUATION] log k + polylog m) worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 [26] by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).

[1]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[2]  Steven Skiena,et al.  Pattern matching with address errors: rearrangement distances , 2006, SODA '06.

[3]  HuangWei,et al.  The communication complexity of the Hamming distance problem , 2006 .

[4]  Raphaël Clifford,et al.  Pseudo-realtime Pattern Matching: Closing the Gap , 2010, CPM.

[5]  J. Rosser,et al.  Approximate formulas for some functions of prime numbers , 1962 .

[6]  Ely Porat,et al.  Swap and mismatch edit distance , 2004, Algorithmica.

[7]  Shengyu Zhang,et al.  The communication complexity of the Hamming distance problem , 2006, Inf. Process. Lett..

[8]  Ely Porat,et al.  Dictionary Matching in a Stream , 2015, ESA.

[9]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[10]  Moshe Lewenstein,et al.  Function Matching , 2006, SIAM J. Comput..

[11]  Ely Porat,et al.  A black box for online approximate pattern matching , 2008, Inf. Comput..

[12]  Kuan-Yu Chen,et al.  Hardness of comparing two run-length encoded strings , 2010, J. Complex..

[13]  V AhoAlfred,et al.  Efficient string matching , 1975 .

[14]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[15]  Zvi Galil,et al.  Real-Time Streaming String-Matching , 2011, CPM.

[16]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[17]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[18]  Yonatan Aumann,et al.  Approximate string matching with address bit errors , 2008, Theor. Comput. Sci..

[19]  Markus Jalsenius,et al.  Parameterized Matching in the Streaming Model , 2013, STACS.

[20]  Ely Porat,et al.  Exact and Approximate Pattern Matching in the Streaming Model , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[21]  Howard J. Karloff Fast Algorithms for Approximately Counting Mismatches , 1993, Inf. Process. Lett..

[22]  Moshe Lewenstein,et al.  Overlap matching , 2001, SODA '01.

[23]  Gad M. Landau,et al.  Efficient String Matching with k Mismatches , 2018, Theor. Comput. Sci..

[24]  Funda Ergün,et al.  Periodicity in Streams , 2010, APPROX-RANDOM.

[25]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[26]  Piotr Indyk,et al.  Faster algorithms for string matching problems: matching the convolution bound , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[27]  Ely Porat,et al.  Space lower bounds for online pattern matching , 2013, Theor. Comput. Sci..

[28]  S. Muthukrishnan,et al.  Alphabet Dependence in Parameterized Matching , 1994, Inf. Process. Lett..

[29]  Ely Porat,et al.  A Black Box for Online Approximate Pattern Matching , 2008, CPM.