Embedding the Ulam metric into l1

Edit distance is a fundamental measure of distance between strings, the ex- tensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e. g., '1) with small dis- tortion, and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into '1 has received a lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured submetrics of the edit distance metric space. Our main technical result is that the Ulam metric, namely, the edit distance on permu- tations of length at most n, embeds into '1 with distortion O(log n). This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O(log n). The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i. e., strings where (close by) substrings are distinct.

[1]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[2]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[3]  Rafail Ostrovsky,et al.  Low distortion embeddings for edit distance , 2005, STOC '05.

[4]  P. Diaconis,et al.  Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem , 1999 .

[5]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6]  Joan Feigenbaum,et al.  Secure Multiparty Computation of Approximations , 2001, ICALP.

[7]  Piotr Indyk Dimensionality reduction techniques for proximity problems , 2000, SODA '00.

[8]  Ronitt Rubinfeld,et al.  A sublinear algorithm for weakly approximating edit distance , 2003, STOC '03.

[9]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2007, TALG.

[10]  Robert Krauthgamer,et al.  Approximating edit distance efficiently , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[11]  Y. Rabani,et al.  Improved lower bounds for embeddings into L 1 , 2006, SODA 2006.

[12]  Piotr Indyk,et al.  Approximate Nearest Neighbor under edit distance via product metrics , 2004, SODA '04.

[13]  Subhash Khot,et al.  Nonembeddability theorems via Fourier analysis , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[14]  Yuval Rabani,et al.  Improved lower bounds for embeddings into L1 , 2006, SODA '06.

[15]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16]  Graham Cormode,et al.  Permutation Editing and Matching via Embeddings , 2001, ICALP.

[17]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[18]  P. Enflo On the nonexistence of uniform homeomorphisms betweenLp-spaces , 1970 .

[19]  Piotr Indyk On approximate nearest neighbors in non-Euclidean spaces , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[20]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.