Faster Algorithms for Longest Common Substring

In the classic longest common substring (LCS) problem, we are given two strings S and T , each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T . Weiner, in his seminal paper that introduced the suffix tree, presented an O(n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an O(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in O(n log σ/ log n) space and read in O(n log σ/ log n) time. We show that, in this model, we can compute an LCS in time O(n log σ/ √ log n), which is sublinear in n if σ = 2o( √ log n) (in particular, if σ = O(1)), using optimal space O(n log σ/ log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in O(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in O(n log n) time for k = O(1) [J. Comput. Biol. 2016]. We show an O(n logk−1/2 n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using O(n) space as the previous approaches. We thus notably break through the well-known n log n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. 2012 ACM Subject Classification Theory of computation → Pattern matching

[1]  Philip Bille,et al.  Towards optimal packed string matching , 2014, Theor. Comput. Sci..

[2]  Yijie Han Deterministic sorting in O(nlog log n) time and linear space , 2002, STOC '02.

[3]  Ricardo A. Baeza-Yates,et al.  Improved string searching , 1989, Softw. Pract. Exp..

[4]  Djamal Belazzougui,et al.  Worst Case Efficient Single and Multiple String Matching in the RAM Model , 2010, IWOCA.

[5]  Tak Wah Lam,et al.  A linear size index for approximate pattern matching , 2011, J. Discrete Algorithms.

[6]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[7]  Gonzalo Navarro,et al.  Text Indexing and Searching in Sublinear Time , 2020, CPM.

[8]  Saeed Seddighin,et al.  Quantum Meets Fine-grained Complexity: Sublinear Time Quantum Algorithms for String Problems , 2020, ArXiv.

[9]  Huacheng Yu,et al.  More Applications of the Polynomial Method to Algorithm Design , 2015, SODA.

[10]  Hjalte Wedel Vildhøj,et al.  Sublinear Space Algorithms for the Longest Common Substring Problem , 2014, ESA.

[11]  Jorma Tarhio,et al.  String Matching in the DNA Alphabet , 1997, Softw. Pract. Exp..

[12]  Tatiana Starikovskaya Longest Common Substring with Approximately k Mismatches , 2016, CPM.

[13]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[14]  Srinivas Aluru,et al.  Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis , 2018, RECOMB.

[15]  Kimmo Fredriksson,et al.  Faster String Matching with Super-Alphabets , 2002, SPIRE.

[16]  Djamal Belazzougui,et al.  Improved Space-Time Tradeoffs for Approximate Full-Text Indexing with One Edit Error , 2011, Algorithmica.

[17]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[18]  Jeffrey Scott Vitter,et al.  Fast Construction of Wavelet Trees , 2014, SPIRE.

[19]  Hjalte Wedel Vildhøj,et al.  Time-Space Trade-Offs for the Longest Common Substring Problem , 2013, CPM.

[20]  Robert E. Tarjan,et al.  A linear-time algorithm for a special case of disjoint set union , 1983, J. Comput. Syst. Sci..

[21]  Wojciech Rytter,et al.  Linear-Time Algorithm for Long LCF with k Mismatches , 2018, CPM.

[22]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[23]  Esko Ukkonen,et al.  Longest common substrings with k mismatches , 2014, Inf. Process. Lett..

[24]  Mamoru Maekawa,et al.  A N algorithm for mutual exclusion in decentralized systems , 1985, TOCS.

[25]  Shay Golan,et al.  Time-Space Tradeoffs for Finding a Long Common Substring , 2020, CPM.

[26]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[27]  Gonzalo Navarro,et al.  A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching , 1998, CPM.

[28]  Costas S. Iliopoulos,et al.  Property Suffix Array with Applications in Indexing Weighted Sequences , 2020, ACM J. Exp. Algorithmics.

[29]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[30]  Philip Bille,et al.  Fast Searching in Packed Strings , 2009, CPM.

[31]  Srinivas Aluru,et al.  A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem , 2016, J. Comput. Biol..

[32]  Christian N. S. Pedersen,et al.  Finding Maximal Quasiperiodicities in Strings , 1999, CPM.

[33]  Tak Wah Lam,et al.  Compressed Indexes for Approximate String Matching , 2010, Algorithmica.

[34]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[35]  Robert E. Tarjan,et al.  A Fast Merging Algorithm , 1979, JACM.

[36]  Russell Impagliazzo,et al.  Which problems have strongly exponential complexity? , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[37]  Philip Bille,et al.  Optimal Packed String Matching , 2011, FSTTCS.

[38]  Laurent Feuilloley,et al.  Lower bounds for text indexing with mismatches and differences , 2019, SODA.

[39]  Maxim A. Babenko,et al.  Computing the longest common substring with one mismatch , 2011, Probl. Inf. Transm..

[40]  Panagiotis Charalampopoulos,et al.  Dynamic Longest Common Substring in Polylogarithmic Time , 2020, ICALP.

[41]  Amihood Amir,et al.  Locally Maximal Common Factors as a Tool for Efficient Dynamic String Algorithms , 2018, CPM.

[42]  Maxime Crochemore,et al.  Longest repeats with a block of k don't cares , 2006, Theor. Comput. Sci..

[43]  Tomasz Kociumaka,et al.  String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure , 2019, STOC.

[44]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[45]  Costas S. Iliopoulos,et al.  Longest Common Prefixes with k-Errors and Applications , 2018, SPIRE.

[46]  Wing-Kai Hon,et al.  Approximate String Matching Using Compressed Suffix Arrays , 2004, CPM.

[47]  Shmuel Tomi Klein,et al.  Accelerating Boyer-Moore searches on binary texts , 2009, Theor. Comput. Sci..

[48]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[49]  Roberto Grossi,et al.  Constant-Time Word-Size String Matching , 2012, CPM.

[50]  Kimmo Fredriksson,et al.  Shift-or string matching with super-alphabets , 2003, Inf. Process. Lett..

[51]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[52]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[53]  Johannes Fischer,et al.  Linear Time Runs over General Ordered Alphabets , 2021, ICALP.

[54]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[55]  Kazuya Tsuruta,et al.  The "Runs" Theorem , 2014, SIAM J. Comput..

[56]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[57]  Domenico Cantone,et al.  Pattern Matching with Swaps for Short Patterns in Linear Time , 2009, SOFSEM.

[58]  Szymon Grabowski,et al.  Approximate pattern matching with k-mismatches in packed text , 2013, Inf. Process. Lett..

[59]  Philip Bille,et al.  Deterministic Indexing for Packed Strings , 2017, CPM.

[60]  Marvin Künnemann,et al.  Few Matches or Almost Periodicity: Faster Pattern Matching with Mismatches in Compressed Texts , 2019, SODA.

[61]  Wojciech Rytter,et al.  Circular Pattern Matching with k Mismatches , 2019, FCT.

[62]  Szymon Grabowski A note on the longest common substring with k-mismatches problem , 2015, Inf. Process. Lett..

[63]  Robert E. Tarjan,et al.  A Linear-Time Algorithm for a Special Case of Disjoint Set Union , 1985, J. Comput. Syst. Sci..

[64]  Maxim A. Babenko,et al.  Wavelet Trees Meet Suffix Trees , 2015, SODA.

[65]  Solon P. Pissis,et al.  Dynamic and Internal Longest Common Substring , 2020, Algorithmica.

[66]  Dekel Tsur Fast index for approximate string matching , 2010, J. Discrete Algorithms.

[67]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.