When Is ''Nearest Neighbor'' Meaningful?

We explore the effect of dimensionality on the "nearest neighbor" problem. We show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. To provide a practical perspective, we present empirical results on both real and synthetic data sets that demonstrate that this effect can occur for as few as 10-15 dimensions. These results should not be interpreted to mean that high-dimensional indexing is never meaningful; we illustrate this point by identifying some high-dimensional workloads for which this effect does not occur. However, our results do emphasize that the methodology used almost universally in the database literature to evaluate high-dimensional indexing techniques is flawed, and should be modified. In particular, most such techniques proposed in the literature are not evaluated versus simple linear scan, and are evaluated over workloads for which nearest neighbor is not meaningful. Often, even the reported experiments, when analyzed carefully, show that linear scan would outperform the techniques being proposed on the workloads studied in high (10-15) dimensionality!

[1]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[2]  Bruce W. Weide,et al.  Optimal Expected-Time Algorithms for Closest Point Problems , 1980, TOMS.

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  David B. Cooper,et al.  Recognition and positioning of rigid objects using algebraic moment invariants , 1991, Optics & Photonics.

[5]  Rajiv Mehrotra,et al.  Feature-based retrieval of similar shapes , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[6]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[7]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[8]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[9]  Alex Pentland,et al.  Photobook: tools for content-based manipulation of image databases , 1994, Electronic Imaging.

[10]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[11]  Sunil Arya,et al.  Accounting for boundary effects in nearest neighbor searching , 1995, SCG '95.

[12]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[13]  Sim Heng Ong,et al.  Image retrieval based on multidimensional feature properties , 1995, Electronic Imaging.

[14]  Alex Pentland,et al.  Photobook: tools for content-based manipulation of image databases , 1994, Other Conferences.

[15]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Juyang Weng,et al.  Using Discriminant Eigenfeatures for Image Retrieval , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  S. Arya Nearest neighbor searching and applications , 1996 .

[18]  J. Simonoff Multivariate Density Estimation , 1996 .

[19]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[20]  Christos Faloutsos,et al.  Analysis of n-Dimensional Quadtrees using the Hausdorff Fractal Dimension , 1996, VLDB.

[21]  Tolga Bozkaya,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[22]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[23]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[24]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[25]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[26]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[27]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[28]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.