Approximate nearest neighbors: towards removing the curse of dimensionality

We present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For data sets of size n living in R d , the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors' STOC'98 and FOCS'01 papers. It unifies, generalizes and simplifies the results from those papers.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[3]  Kari Karhunen,et al.  Über lineare Methoden in der Wahrscheinlichkeitsrechnung , 1947 .

[4]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  Terry A. Welch,et al.  BOUNDS ON INFORMATION RETRIEVAL EFFICIENCY IN STATIC FILE STRUCTURES. , 1971 .

[7]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[11]  Michael Ian Shamos,et al.  Closest-point problems , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[12]  Richard J. Lipton,et al.  Multidimensional Searching Problems , 1976, SIAM J. Comput..

[13]  Robert E. Tarjan,et al.  Applications of a planar separator theorem , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[14]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[15]  Nimrod Megiddo,et al.  Applying parallel computation algorithms in the design of serial algorithms , 1981, 22nd Annual Symposium on Foundations of Computer Science (sfcs 1981).

[16]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[17]  L. Devroye,et al.  8 Nearest neighbor methods in discrimination , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[18]  William B. Johnson,et al.  Embeddinglpm intol1n , 1982 .

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[21]  Andrew Chi-Chih Yao,et al.  A general approach to d-dimensional geometric queries , 1985, STOC '85.

[22]  Editors , 1986, Brain Research Bulletin.

[23]  Herbert Edelsbrunner,et al.  Algorithms in Combinatorial Geometry , 1987, EATCS Monographs in Theoretical Computer Science.

[24]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory B.

[25]  Kenneth L. Clarkson,et al.  A Randomized Algorithm for Closest-Point Queries , 1988, SIAM J. Comput..

[26]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[27]  Sanguthevar Rajasekaran,et al.  The light bulb problem , 1995, COLT '89.

[28]  G. Pisier The volume of convex bodies and Banach space geometry , 1989 .

[29]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[30]  Otfried Cheong,et al.  Euclidean minimum spanning trees and bichromatic closest pairs , 1990, SCG '90.

[31]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[32]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[33]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[34]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[35]  Jirí Matousek,et al.  Reporting Points in Halfspaces , 1992, Comput. Geom..

[36]  Jirí Matousek,et al.  Ray shooting and parametric search , 1992, STOC '92.

[37]  S. Meiser,et al.  Point Location in Arrangements of Hyperplanes , 1993, Inf. Comput..

[38]  Danny Dolev,et al.  Finding the neighborhood of a query in a dictionary , 1993, [1993] The 2nd Israel Symposium on Theory and Computing Systems.

[39]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[41]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[42]  Sunil Arya,et al.  Approximate nearest neighbor queries in fixed dimensions , 1993, SODA '93.

[43]  F. Frances Yao,et al.  Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[44]  N. Linial,et al.  The geometry of graphs and some of its algorithmic applications , 1994, FOCS.

[45]  Noam Nisan,et al.  Neighborhood preserving hashing and approximate queries , 1994, SODA '94.

[46]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[47]  Sunil Arya,et al.  Accounting for boundary effects in nearest neighbor searching , 1995, SCG '95.

[48]  S. Rao Kosaraju,et al.  A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields , 1995, JACM.

[49]  Michael W. Berry,et al.  A Case Study of Latent Semantic Indexing , 1995 .

[50]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[51]  Alex Pentland,et al.  Photobook: tools for content-based manipulation of image databases , 1994, Other Conferences.

[52]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[53]  Geoffrey Zweig,et al.  The bit vector intersection problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[54]  D. Vernon Inform , 1995, Encyclopedia of the UN Sustainable Development Goals.

[55]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[56]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Arne AnderssonLund Universityarne Static Dictionaries on Ac 0 Rams: Query Time ( R Log N= Log Log N) Is Necessary and Suucient , 1996 .

[58]  A. Andersson Static Dictionaries on RAMs: Query Time is Necessary and Sufficient , 1996, FOCS 1996.

[59]  Christos Faloutsos,et al.  Multidimensional Access Methods: Trees Have Grown Everywhere , 1997, VLDB.

[60]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1997, STOC '97.

[61]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[62]  Timothy M. Chan Approximate Nearest Neighbor Queries Revisited , 1997, SCG '97.

[63]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[64]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[65]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[66]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[67]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[68]  Frédéric Cazals,et al.  Effective nearest neighbors searching on the hyper-cube, with applications to molecular clustering , 1998, SCG '98.

[69]  David Eppstein,et al.  Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[70]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[71]  Arnold W. M. Smeulders,et al.  Image Databases and Multi-Media Search , 1998, Image Databases and Multi-Media Search.

[72]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[73]  Piotr Indyk,et al.  Geometric matching under noise: combinatorial bounds and algorithms , 1999, SODA '99.

[74]  Allan Borodin,et al.  Subquadratic approximation algorithms for clustering problems in high dimensional spaces , 1999, STOC '99.

[75]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[76]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.