Similarity estimation techniques from rounding algorithms

(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects <i>x,y</i>, <b>Pr</b><sub><i>h</i></sub>εF[<i>h</i>(<i>x</i>) = <i>h</i>(<i>y</i>)] = sim(<i>x,y</i>), where <i>sim</i>(<i>x,y</i>) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure <i>sim</i>(<i>A,B</i>) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:<ol><li>A collection of vectors with the distance between → \over <i>u</i> and → \over <i>v</i> measured by Ø(→ \over <i>u</i>, → \over <i>v</i>)/π, where Ø(→ \over <i>u</i>, → \over <i>v</i>) is the angle between → \over <i>u</i>) and → \over <i>v</i>). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.</li><li>A collection of distributions on <i>n</i> points in a metric space, with distance between distributions measured by the Earth Mover Distance (<b>EMD</b>), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions <i>P</i> and <i>Q</i>, <b>EMD</b>(<i>P,Q</i>) &xie; <b>E</b><sub>hε\F</sub> [<i>d</i>(<i>h</i>(<i>P</i>),<i>h</i>(<i>Q</i>))] &xie; <i>O</i>(log <i>n</i> log log <i>n</i>). <b>EMD</b>(<i>P, Q</i>).</li></ol>.

[1]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[2]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[3]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[4]  Ori Sasson,et al.  Non-Expansive Hashing , 1996, STOC '96.

[5]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[6]  L. Guibas,et al.  The Earth Mover''s Distance: Lower Bounds and Invariance under Translation , 1997 .

[7]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[8]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[9]  Santosh S. Vempala,et al.  Locality-preserving hashing in multidimensional spaces , 1997, STOC '97.

[10]  C. Tomasi The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval , 1997 .

[11]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[12]  Carlo Tomasi,et al.  Texture metrics , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[13]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[14]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[15]  Piotr Indyk On approximate nearest neighbors in non-Euclidean spaces , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[16]  Yair Bartal,et al.  On approximating arbitrary metrices by tree metrics , 1998, STOC '98.

[17]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[18]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[19]  Carlo Tomasi,et al.  Color edge detection with the compass operator , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[20]  Leonidas J. Guibas,et al.  The Earth Mover's Distance under transformation sets , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[21]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[22]  Carlo Tomasi,et al.  Perceptual metrics for image database navigation , 1999 .

[23]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[24]  Carlo Tomasi,et al.  Corner detection in textured color images , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[25]  Éva Tardos,et al.  A constant factor approximation algorithm for a class of classification problems , 2000, STOC '00.

[26]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[27]  Robert Krauthgamer,et al.  Improved classification via connectivity information , 2000, SODA '00.

[28]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[29]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[30]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[31]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[32]  Fumio Harashima,et al.  IEEE International Conference on Systems, Man, and Cybernetics , 2000 .

[33]  Joseph Naor,et al.  Approximation algorithms for the metric labeling problem via a new linear programming formulation , 2001, SODA '01.

[34]  GunopulosDimitrios,et al.  Efficient and tumble similar set retrieval , 2001 .

[35]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[36]  Anna C. Gilbert,et al.  QuickSAND: Quick Summary and Analysis of Network Data , 2001 .

[37]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[38]  Yuval Rabani,et al.  Approximation algorithms for the 0-extension problem , 2001, SODA '01.

[39]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[40]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 2002, JACM.

[41]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[42]  Ryan O'Donnell,et al.  Derandomized dimensionality reduction with applications , 2002, SODA '02.

[43]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[44]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.