Streaming and Small Space Approximation Algorithms for Edit Distance and Longest Common Subsequence

The edit distance (ED) and longest common subsequence (LCS) are two fundamental problems which quantify how similar two strings are to one another. In this paper, we first consider these problems in the asymmetric streaming model introduced by Andoni, Krauthgamer and Onak [11] (FOCS’10) and Saks and Seshadhri [64] (SODA’13). In this model we have random access to one string and streaming access the other one. Our main contribution is a constant factor approximation algorithm for ED with memory Õ(n) for any constant δ > 0. In addition to this, we present an upper bound of Õε( √ n) on the memory needed to approximate ED or LCS within a factor 1 ± ε. All our algorithms are deterministic and run in polynomial time in a single pass. We further study small-space approximation algorithms for ED, LCS, and longest increasing sequence (LIS) in the non-streaming setting. Here, we design algorithms that achieve 1 ± ε approximation for all three problems, where ε > 0 can be any constant and even slightly sub-constant. Our algorithms only use poly-logarithmic space while maintaining a polynomial running time. This significantly improves previous results in terms of space complexity, where all known results need to use space at least Ω( √ n). Our algorithms make novel use of triangle inequality and carefully designed recursions to save space, which can be of independent interest. 2012 ACM Subject Classification Theory of computation → Design and analysis of algorithms

[1]  Yota Otachi,et al.  Longest Common Subsequence in Sublinear Space , 2020, Inf. Process. Lett..

[2]  Saeed Seddighin,et al.  Improved MPC Algorithms for Edit Distance and Ulam Distance , 2019, IEEE Transactions on Parallel and Distributed Systems.

[3]  Jelani Nelson,et al.  An Improved Sketching Algorithm for Edit Distance , 2021, STACS.

[4]  Negev Shekel Nosatzki,et al.  Edit Distance in Near-Linear Time: it's a Constant Factor , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[5]  Alireza Farhadi,et al.  Streaming with Oracle: New Streaming Algorithms for Edit Distance and LCS , 2020, ArXiv.

[6]  Zhengzhong Jin,et al.  Space Efficient Deterministic Approximation of String Measures , 2020, ArXiv.

[7]  Michael E. Saks,et al.  Constant factor approximations to edit distance on far input pairs in nearly linear time , 2019, STOC.

[8]  Zhao Song,et al.  Reducing approximate Longest Common Subsequence to approximate Edit Distance , 2019, SODA.

[9]  Aviad Rubinstein,et al.  Constant-factor approximation of near-linear edit distance in near-linear time , 2019, STOC.

[10]  Xiaorui Sun,et al.  Approximation Algorithms for LCS and LIS with Truly Improved Running Times , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[11]  Robert Krauthgamer,et al.  Sublinear Algorithms for Gap Edit Distance , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[12]  William Kuszmaul,et al.  Dynamic Time Warping in Strongly Subquadratic Time: Algorithms for the Low-Distance Regime and Approximate Evaluation , 2019, ICALP.

[13]  Bernhard Haeupler,et al.  Near-linear time insertion-deletion codes and (1+ε)-approximating edit distance via indexing , 2018, STOC.

[14]  Guy N. Rothblum,et al.  Fine-grained Complexity Meets IP = PSPACE , 2018, SODA.

[15]  Yota Otachi,et al.  Space-Efficient Algorithms for Longest Increasing Subsequence , 2017, Theory of Computing Systems.

[16]  Mohammad Taghi Hajiaghayi,et al.  Massively Parallel Approximation Algorithms for Edit Distance and Longest Common Subsequence , 2019, SODA.

[17]  Michael E. Saks,et al.  Approximating Edit Distance within Constant Factor in Truly Sub-Quadratic Time , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[18]  Moses Charikar,et al.  On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings , 2018, ICALP.

[19]  Marvin Künnemann,et al.  Multivariate Fine-Grained Complexity of Longest Common Subsequence , 2018, SODA.

[20]  Mohammad Ghodsi,et al.  Approximating Edit Distance in Truly Subquadratic Time: Quantum and MapReduce , 2018, SODA.

[21]  Amir Abboud,et al.  Fast and Deterministic Constant Factor Approximation Algorithms for LCS Imply New Circuit Lower Bounds , 2018, ITCS.

[22]  Barna Saha,et al.  Fast & Space-Efficient Approximations of Language Edit Distance and RNA Folding: An Amnesic Dynamic Programming Approach , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[23]  Micha Sharir,et al.  Dynamic Time Warping and Geometric Edit Distance: Breaking the Quadratic Barrier , 2016, ICALP.

[24]  Amir Abboud,et al.  Towards Hardness of Approximation for Polynomial Time Problems , 2017, ITCS.

[25]  Barna Saha,et al.  Approximating Language Edit Distance Beyond Fast Matrix Multiplication: Ultralinear Grammars Are Where Parsing Becomes Hard! , 2017, ICALP.

[26]  Fabrizio Grandoni,et al.  Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[27]  Qin Zhang,et al.  Edit Distance: Sketching, Streaming, and Document Exchange , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[28]  Michal Koucký,et al.  Streaming Algorithms For Computing Edit Distance Without Exploiting Suffix Trees , 2016, ArXiv.

[29]  Michal Koucký,et al.  Streaming algorithms for embedding and computing edit distance in the low distance regime , 2016, STOC.

[30]  Ryan Williams,et al.  Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made , 2015, STOC.

[31]  Amir Abboud,et al.  Tight Hardness Results for LCS and Other Sequence Similarity Measures , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[32]  Marvin Künnemann,et al.  Quadratic Conditional Lower Bounds for String Problems and Dynamic Time Warping , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[33]  Michael E. Saks,et al.  A polylogarithmic space deterministic streaming algorithm for approximating distance to monotonicity , 2015, SODA.

[34]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[35]  Barna Saha,et al.  Language Edit Distance and Maximum Likelihood Parsing of Stochastic Grammars: Faster Algorithms and Connection to Fundamental Graph Problems , 2014, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[36]  Alexandr Andoni,et al.  Homomorphic fingerprints under misalignments: sketching edit and shift distances , 2013, STOC '13.

[37]  Michael E. Saks,et al.  Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance , 2012, SODA.

[38]  Alexandr Andoni,et al.  The smoothed complexity of edit distance , 2008, TALG.

[39]  Alexandr Andoni,et al.  Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[40]  Alexandr Andoni,et al.  Approximating edit distance in near-linear time , 2009, STOC '09.

[41]  Funda Ergün,et al.  On distance to monotonicity and longest increasing subsequence of a data stream , 2008, SODA '08.

[42]  Alexander Tiskin,et al.  Semi-local String Comparison: Algorithmic Techniques and Applications , 2007, Math. Comput. Sci..

[43]  Moshe Lewenstein,et al.  On the Longest Common Rigid Subsequence Problem , 2005, Algorithmica.

[44]  Alexandr Andoni,et al.  The Computational Hardness of Estimating Edit Distance [Extended Abstract] , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[45]  Anna Gál,et al.  Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[46]  Robert Krauthgamer,et al.  Estimating the sortedness of a data stream , 2007, SODA '07.

[47]  David P. Woodruff,et al.  The communication and streaming complexity of computing the longest common and increasing subsequences , 2007, SODA '07.

[48]  Edson Cáceres,et al.  A Coarse-Grained Parallel Algorithm for the All-Substrings Longest Common Subsequence Problem , 2006, Algorithmica.

[49]  Funda Ergün,et al.  Oblivious string embeddings and edit distance approximations , 2006, SODA '06.

[50]  Erik Vee,et al.  Finding longest increasing and common subsequences in streaming data , 2005, J. Comb. Optim..

[51]  Rafail Ostrovsky,et al.  Low distortion embeddings for edit distance , 2005, STOC '05.

[52]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[53]  Amit Kumar,et al.  Correlating XML data streams using tree-edit distance embeddings , 2003, PODS '03.

[54]  Gad M. Landau,et al.  A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices , 2003, SIAM J. Comput..

[55]  Alexandr Andoni,et al.  Lower bounds for embedding edit distance into normed spaces , 2003, SODA '03.

[56]  Maxime Crochemore,et al.  A fast and practical bit-vector algorithm for the Longest Common Subsequence problem , 2001, Inf. Process. Lett..

[57]  Piotr Indyk,et al.  Algorithmic applications of low-distortion geometric embeddings , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[58]  P. Diaconis,et al.  Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem , 1999 .

[59]  J. Boutet de Monvel Extensive simulations for longest common subsequences , 1999 .

[60]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[61]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[62]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[63]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[64]  Walter J. Savitch,et al.  Relationships Between Nondeterministic and Deterministic Tape Complexities , 1970, J. Comput. Syst. Sci..