Edit Distance: Sketching, Streaming, and Document Exchange

We show that in the document exchange problem, where Alice holds x ϵ {0, 1}n and Bob holds y ϵ {0, 1}n, Alice can send Bob a message of size O(K(log2 K + log n)) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise. Both the encoding and decoding can be done in time Õ(n + poly(K)). This result significantly improves on the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold x and y respectively, they can compute sketches of x and y of sizes poly(K log n) bits (the encoding), and send to the referee, who can then compute the edit distance between x and y together with all the edit operations if the edit distance is no more than K, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(K log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(K log n) bits of space.

[1]  Shuhong Gao,et al.  A New Algorithm for Decoding Reed-Solomon Codes , 2003 .

[2]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[3]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[4]  Barna Saha,et al.  The Dyck Language Edit Distance Problem in Near-Linear Time , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[5]  Michal Koucký,et al.  Low Distortion Embedding from Edit to Hamming Distance using Coupling , 2015, Electron. Colloquium Comput. Complex..

[6]  Michal Koucký,et al.  Streaming algorithms for embedding and computing edit distance in the low distance regime , 2016, STOC.

[7]  David P. Woodruff,et al.  The communication and streaming complexity of computing the longest common and increasing subsequences , 2007, SODA '07.

[8]  Torsten Suel,et al.  Improved single-round protocols for remote file synchronization , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[9]  Robert Krauthgamer,et al.  Estimating the sortedness of a data stream , 2007, SODA '07.

[10]  Alon Orlitsky,et al.  Interactive communication: balanced distributions, correlated files, and average-case complexity , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[11]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[12]  Y. Rabani,et al.  Improved lower bounds for embeddings into L 1 , 2006, SODA 2006.

[13]  Hossein Jowhari,et al.  Efficient Communication Protocols for Deciding Edit Distance , 2012, ESA.

[14]  Funda Ergün,et al.  On distance to monotonicity and longest increasing subsequence of a data stream , 2008, SODA '08.

[15]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[16]  Anna Gál,et al.  Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[17]  Yossi Matias,et al.  Polynomial Hash Functions Are Reliable (Extended Abstract) , 1992, ICALP.

[18]  Alexandr Andoni,et al.  Homomorphic fingerprints under misalignments: sketching edit and shift distances , 2013, STOC '13.

[19]  Yuval Rabani,et al.  Improved lower bounds for embeddings into L1 , 2006, SODA '06.

[20]  Qin Zhang,et al.  Edit Distance to Monotonicity in Sliding Windows , 2011, ISAAC.

[21]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[22]  Alexandr Andoni,et al.  Overcoming the l1 non-embeddability barrier: algorithms for product metrics , 2009, SODA.

[23]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[24]  Venkatesan Guruswami,et al.  Efficient Low-Redundancy Codes for Correcting Multiple Deletions , 2015, IEEE Transactions on Information Theory.

[25]  Rafail Ostrovsky,et al.  Low distortion embeddings for edit distance , 2007, JACM.

[26]  Michael E. Saks,et al.  Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance , 2012, SODA.

[27]  Robert Krauthgamer,et al.  Embedding the Ulam metric into l1 , 2006, Theory Comput..

[28]  Ely Porat,et al.  Improved Sketching of Hamming Distance with Error Correcting , 2007, CPM.

[29]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[30]  Djamal Belazzougui,et al.  Efficient Deterministic Single Round Document Exchange for Edit Distance , 2015, ArXiv.

[31]  Georgios P. Papamichail,et al.  Improved algorithms for approximate string matching (extended abstract) , 2009, BMC Bioinformatics.

[32]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..