Streaming algorithms for embedding and computing edit distance in the low distance regime

The Hamming and the edit metrics are two common notions of measuring distances between pairs of strings x,y lying in the Boolean hypercube. The edit distance between x and y is defined as the minimum number of character insertion, deletion, and bit flips needed for converting x into y. Whereas, the Hamming distance between x and y is the number of bit flips needed for converting x to y. In this paper we study a randomized injective embedding of the edit distance into the Hamming distance with a small distortion. We show a randomized embedding with quadratic distortion. Namely, for any x,y satisfying that their edit distance equals k, the Hamming distance between the embedding of x and y is O(k2) with high probability. This improves over the distortion ratio of O( n * n) obtained by Jowhari (2012) for small values of k. Moreover, the embedding output size is linear in the input size and the embedding can be computed using a single pass over the input. We provide several applications for this embedding. Among our results we provide a one-pass (streaming) algorithm for edit distance running in space O(s) and computing edit distance exactly up-to distance s1/6. This algorithm is based on kernelization for edit distance that is of independent interest.

[1]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[2]  Barna Saha,et al.  The Dyck Language Edit Distance Problem in Near-Linear Time , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[3]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[4]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[5]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[6]  Alon Orlitsky,et al.  Interactive communication: balanced distributions, correlated files, and average-case complexity , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[7]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[8]  A. Razborov Communication Complexity , 2011 .

[9]  Oded Goldreich,et al.  The Foundations of Cryptography - Volume 1: Basic Techniques , 2001 .

[10]  Alexandr Andoni,et al.  Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[11]  Rafail Ostrovsky,et al.  Low distortion embeddings for edit distance , 2005, STOC '05.

[12]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[13]  Martin Dietzfelbinger,et al.  Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes , 1996, STACS.

[14]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[15]  Ronitt Rubinfeld,et al.  A sublinear algorithm for weakly approximating edit distance , 2003, STOC '03.

[16]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[17]  Oded Goldreich Foundations of Cryptography: Index , 2001 .

[18]  Robert J. Vanderbei,et al.  The Kruskal Count , 2009, The Mathematics of Preference, Choice and Order.

[19]  Oded Goldreich Foundations of Cryptography: Volume 1 , 2006 .

[20]  Oded Goldreich,et al.  Foundations of Cryptography: List of Figures , 2001 .

[21]  Yuval Rabani,et al.  Improved lower bounds for embeddings into L1 , 2006, SODA '06.

[22]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[23]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[24]  Hossein Jowhari,et al.  Efficient Communication Protocols for Deciding Edit Distance , 2012, ESA.

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[26]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[27]  Alexandr Andoni,et al.  Lower bounds for embedding edit distance into normed spaces , 2003, SODA '03.

[28]  Leah Epstein,et al.  Algorithms-- ESA 2012 : 20th Annual European Symposium, Ljubljana, Slovenia, September 10-12, 2012. Proceedings , 2012 .

[29]  英哉 岩崎 20世紀の名著名論:D. E. Knuth J. H. Morris V. R. Pratt : Fast pattern matching in Strings , 2004 .

[30]  Funda Ergün,et al.  Oblivious string embeddings and edit distance approximations , 2006, SODA '06.

[31]  Ely Porat,et al.  Improved Sketching of Hamming Distance with Error Correcting , 2007, CPM.

[32]  S. Rajsbaum Foundations of Cryptography , 2014 .

[33]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[34]  Y. Rabani,et al.  Improved lower bounds for embeddings into L 1 , 2006, SODA 2006.

[35]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[36]  Alexandr Andoni,et al.  Approximating edit distance in near-linear time , 2009, STOC '09.

[37]  Subhash Khot,et al.  Nonembeddability theorems via Fourier analysis , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[38]  V. Climenhaga Markov chains and mixing times , 2013 .

[39]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.