Locality-sensitive hashing scheme based on p-stable distributions

We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p<1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain "bounded growth" condition.Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than kd-tree.

[1]  C. Mallows,et al.  A Method for Simulating Stable Random Variables , 1976 .

[2]  Giorgio Ausiello,et al.  Proceedings of the International Conference on Database Theory , 1986, ICDT 1986.

[3]  V. Zolotarev One-dimensional stable distributions , 1986 .

[4]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1997, STOC '97.

[5]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[8]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .

[9]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[10]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[11]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[12]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[13]  Hector Garcia-Molina,et al.  Detecting Digital Copyright Violations On The Internet , 1999 .

[14]  D. Keim,et al.  What Is the Nearest Neighbor in High Dimensional Spaces? , 2000, VLDB.

[15]  P. Indyk Stable distributions, pseudorandom generators, embeddings and data stream computation , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[17]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[18]  Peter N. Yianilos,et al.  Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract) , 2000, SODA '00.

[19]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[20]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[21]  Cheng Yang MACS: music audio characteristic sequence indexing for similarity retrieval , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[22]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[23]  Sariel Har-Peled A replacement for Voronoi diagrams of near linear size , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[24]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[25]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[26]  Jeremy Buhler,et al.  Provably sensitive Indexing strategies for biosequence similarity search , 2002, RECOMB '02.

[27]  Piotr Indyk,et al.  Fast mining of massive tabular data via approximate distance computations , 2002, Proceedings 18th International Conference on Data Engineering.

[28]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[29]  Ilan Shimshoni,et al.  Mean shift based clustering in high dimensions: a texture classification example , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[31]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.