Approximating longest common substring with $k$ mismatches: Theory and practice

In the problem of the longest common substring with $k$ mismatches we are given two strings $X, Y$ and must find the maximal length $\ell$ such that there is a length-$\ell$ substring of $X$ and a length-$\ell$ substring of $Y$ that differ in at most $k$ positions. The length $\ell$ can be used as a robust measure of similarity between $X, Y$. In this work, we develop new approximation algorithms for computing $\ell$ that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

[1]  Wojciech Rytter,et al.  Linear-Time Algorithm for Long LCF with k Mismatches , 2018, CPM.

[2]  Russell Impagliazzo,et al.  Which problems have strongly exponential complexity? , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[3]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[4]  Srinivas Aluru,et al.  A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem , 2016, J. Comput. Biol..

[5]  Harald Helfgott,et al.  Deterministic methods to find primes , 2011, Math. Comput..

[6]  Ely Porat,et al.  Exact and Approximate Pattern Matching in the Streaming Model , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[7]  Maxim A. Babenko,et al.  Computing the longest common substring with one mismatch , 2011, Probl. Inf. Transm..

[8]  V. V. Williams ON SOME FINE-GRAINED QUESTIONS IN ALGORITHMS AND COMPLEXITY , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[9]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[10]  Alexandr Andoni,et al.  Efficient algorithms for substring near neighbor problem , 2006, SODA '06.

[11]  Yongchao Liu,et al.  ALFRED: A Practical Method for Alignment-Free Distance Computation , 2016, J. Comput. Biol..

[12]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[13]  Alexandr Andoni,et al.  Optimal Data-Dependent Hashing for Approximate Near Neighbors , 2015, STOC.

[14]  Tatiana Starikovskaya Longest Common Substring with Approximately k Mismatches , 2016, CPM.

[15]  Aviad Rubinstein,et al.  Hardness of approximate nearest neighbor search , 2018, STOC.

[16]  Peter Winkler,et al.  On playing “Twenty Questions” with a liar , 1992, SODA '92.

[17]  Manindra Agrawal,et al.  PRIMES is in P , 2004 .

[18]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[19]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[20]  Huacheng Yu,et al.  More Applications of the Polynomial Method to Algorithm Design , 2015, SODA.

[21]  Michael E. Saks,et al.  Approximating Edit Distance within Constant Factor in Truly Sub-Quadratic Time , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[22]  Szymon Grabowski A note on the longest common substring with k-mismatches problem , 2015, Inf. Process. Lett..

[23]  M. Chao A general purpose unequal probability sampling plan , 1982 .

[24]  Z Galil,et al.  Improved string matching with k mismatches , 1986, SIGA.

[25]  Yongchao Liu,et al.  A greedy alignment-free distance estimator for phylogenetic inference , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[26]  Esko Ukkonen,et al.  Longest common substrings with k mismatches , 2014, Inf. Process. Lett..

[27]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.