Two algorithms for nearest-neighbor search in high dimensions

Representing data as points in a high-dimensional space, so as to use geometric methods for indexing, is an algorithmic technique with a wide array of uses. It is central to a number of areas such as information retrieval, pattern recognition, and statistical data analysis; many of the problems arising in these applications can involve several hundred or several thousand dimensions. We consider the nearest-neighbor problem for d-dimensional Euclidean space: we wish to pre-process a database of n points so that given a query point, one can efficiently determine its nearest neighbors in the database. There is a large literature on algorithms for this problem, in both the exact and approximate cases. The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the averagecase analysis of heuristics such as k-d trees reveals an exponential dependence on d in the query time. In this work, we develop a new approach to the nearest-neighbor problem, based on a method for combining randomly chosen one-dimensional projections of the underlying point set. From this, we obtain the following two results. (i) An algorithm for finding e-approximate nearest neighbors with a query time of O((d log d)(d + log n)). (ii) An e-approximate nearest-neighbor algorithm with near-linear storage and a query time that improves asymptotically on linear search in all dimensions. ∗Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu. This work was performed in large part while on leave at the IBM Almaden Research Center, San Jose CA 95120. The author is currently supported by an Alfred P. Sloan Research Fellowship and by NSF Faculty Early Career Development Award CCR-9701399.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[3]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[6]  Richard J. Lipton,et al.  Multidimensional Searching Problems , 1976, SIAM J. Comput..

[7]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[8]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[9]  L. Devroye,et al.  8 Nearest neighbor methods in discrimination , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[10]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[11]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1984, JACM.

[12]  Andrew Chi-Chih Yao,et al.  A general approach to d-dimensional geometric queries , 1985, STOC '85.

[13]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1987, JACM.

[14]  Herbert Edelsbrunner,et al.  Algorithms in Combinatorial Geometry , 1987, EATCS Monographs in Theoretical Computer Science.

[15]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory B.

[16]  Kenneth L. Clarkson,et al.  A Randomized Algorithm for Closest-Point Queries , 1988, SIAM J. Comput..

[17]  Gerald Salton,et al.  Automatic text processing , 1988 .

[18]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[19]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[20]  Eli Upfal,et al.  Computing with unreliable information , 1990, STOC '90.

[21]  J. Matoussek Reporting points in halfspaces , 1991, FOCS 1991.

[22]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[23]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[24]  S. Rao Kosaraju,et al.  A decomposition of multi-dimensional point-sets with applications to k-nearest-neighbors and n-body potential fields (preliminary version) , 1992, STOC '92.

[25]  Jirí Matousek,et al.  Reporting Points in Halfspaces , 1992, Comput. Geom..

[26]  Jirí Matousek,et al.  Ray shooting and parametric search , 1992, STOC '92.

[27]  S. Meiser,et al.  Point Location in Arrangements of Hyperplanes , 1993, Inf. Comput..

[28]  Alex Pentland,et al.  Photobook: tools for content-based manipulation of image databases , 1994, Electronic Imaging.

[29]  Claire Mathieu,et al.  Selection in the presence of noise: the design of playoff systems , 1994, SODA '94.

[30]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[31]  S. Rao Kosaraju,et al.  A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields , 1995, JACM.

[32]  Michael W. Berry,et al.  A Case Study of Latent Semantic Indexing , 1995 .

[33]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[34]  Alex Pentland,et al.  Photobook: tools for content-based manipulation of image databases , 1994, Other Conferences.

[35]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[36]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[37]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[38]  S. Arya Nearest neighbor searching and applications , 1996 .

[39]  Edith Cohen,et al.  Approximating matrix multiplication for pattern recognition tasks , 1997, SODA '97.

[40]  Satissed Now Consider Improved Approximation Algorithms for Maximum Cut and Satissability Problems Using Semideenite Programming , 1997 .

[41]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[42]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[43]  David R. Karger,et al.  Approximate graph coloring by semidefinite programming , 1998, JACM.

[44]  Arnold W. M. Smeulders,et al.  Image Databases and Multi-Media Search , 1998, Image Databases and Multi-Media Search.

[45]  Madhu Sudan,et al.  A Geometric Approach to Betweenness , 1995, ESA.