Longest Common Substring with Approximately k Mismatches

In the longest common substring problem, we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one character. To circumvent this, Leimeister and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature. In this paper, we first show a conditional lower bound based on the SETH hypothesis implying that there is little hope to improve existing solutions. We then introduce a new but closely related problem of the longest common substring with approximately k mismatches and use locality-sensitive hashing to show that it admits a solution with strongly subquadratic running time. We also apply these results to obtain a strongly subquadratic-time 2-approximation algorithm for the longest common substring with k mismatches problem and show conditional hardness of improving its approximation ratio.

[1]  Maxim A. Babenko,et al.  Computing the longest common substring with one mismatch , 2011, Probl. Inf. Transm..

[2]  Philip Bille,et al.  Longest Common Extensions via Fingerprinting , 2012, LATA.

[3]  Ryan Williams,et al.  A new algorithm for optimal 2-constraint satisfaction and its implications , 2005, Theor. Comput. Sci..

[4]  Szymon Grabowski A note on the longest common substring with k-mismatches problem , 2015, Inf. Process. Lett..

[5]  Raffaele Giancarlo,et al.  Parallel String Matching with k Mismatches , 1987, Theor. Comput. Sci..

[6]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[7]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[8]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[9]  Richard J. Lipton,et al.  On the complexity of SAT , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[10]  Alexandr Andoni,et al.  Efficient algorithms for substring near neighbor problem , 2006, SODA '06.

[11]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[12]  Lucian Ilie,et al.  The longest common extension problem revisited and applications to approximate string searching , 2010, J. Discrete Algorithms.

[13]  Russell Impagliazzo,et al.  On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[14]  Moshe Lewenstein,et al.  Clustered Integer 3SUM via Additive Combinatorics , 2015, STOC.

[15]  Hjalte Wedel Vildhøj,et al.  Time-Space Trade-Offs for the Longest Common Substring Problem , 2013, CPM.

[16]  Michal Pilipczuk,et al.  Parameterized Algorithms , 2015, Springer International Publishing.

[17]  Huacheng Yu,et al.  More Applications of the Polynomial Method to Algorithm Design , 2015, SODA.

[18]  Harald Helfgott,et al.  Deterministic methods to find primes , 2011, Math. Comput..

[19]  Tatiana Starikovskaya 21 : 2 Longest Common Substring with Approximately k Mismatches , 2016 .

[20]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[21]  Wojciech Rytter,et al.  Linear-Time Algorithm for Long LCF with k Mismatches , 2018, CPM.

[22]  Ely Porat,et al.  Exact and Approximate Pattern Matching in the Streaming Model , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[23]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[24]  Philip Bille,et al.  Time-Space Trade-Offs for Longest Common Extensions , 2012, CPM.

[25]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[26]  Hjalte Wedel Vildhøj,et al.  Sublinear Space Algorithms for the Longest Common Substring Problem , 2014, ESA.

[27]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[28]  Tatiana Starikovskaia Longest Common Substring with Approximately k Mismatches , 2016, CPM 2016.

[29]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[30]  Russell Impagliazzo,et al.  Which problems have strongly exponential complexity? , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[31]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[32]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[33]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[34]  Gad M. Landau,et al.  Efficient String Matching with k Mismatches , 2018, Theor. Comput. Sci..

[35]  Srinivas Aluru,et al.  Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis , 2018, RECOMB.

[36]  Maxim A. Babenko,et al.  Computing Longest Common Substrings Via Suffix Arrays , 2008, CSR.

[37]  Esko Ukkonen,et al.  Longest common substrings with k mismatches , 2014, Inf. Process. Lett..

[38]  Srinivas Aluru,et al.  A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem , 2016, J. Comput. Biol..

[39]  Manindra Agrawal,et al.  PRIMES is in P , 2004 .