Density-based indexing for approximate nearest-neighbor queries

We consider the problem of performing nearest-neighbor queries e ciently over large high-dimensional databases. Assuming that a full database scan to determine the nearest neighbor entries is not acceptable, we study the possibility of constructing an index structure over the database. It is well-accepted that traditional database indexing algorithms fail for high-dimensional data (say d > 10 or 20 depending on the scheme). Some arguments have advocated that nearest-neighbor queries do not even make sense for high-dimensional data since the ratio of maximum and minimum distance goes to 1 as dimensionality increases. We show that these arguments are based on over-restrictive assumptions, and that in the general case it is meaningful and possible to perform such queries. We present an approach for deriving a multidimensional index to support approximate nearestneighbor queries over large databases. Our approach, called DBIN, scales to high-dimensional databases by exploiting statistical properties of the data. The approach is based on statistically modeling the density of the content of the data table. DBIN uses the density model to derive a single index over the data table and requires physically re-writing data in a new table sorted by the newly created index (i.e. create what is known as a clustered-index in the database literature). The indexing scheme produces a mapping between a query point (a data record) and an ordering on the clustered index values. Data is then scanned according to the index until the probability that the nearest-neighbor has been found exceeds some threshold. We present theoretical and empirical justi cation for DBIN. The scheme supports a family of distance functions which includes the traditional Euclidean distance measure. Microsoft Research Technical Report MSR-TR-98-58 Revised: February 28, 1999 Contact Author: Usama Fayyad (http://research.microsoft.com/~fayyad) address: Microsoft Research One Microsoft Way Redmond, WA 98008, USA phone: +1-425-703-1528 fax: +1-425-936-7329 e-mail: fayyad@microsoft.com This work was performed while the author was visiting Microsoft Research This work was performed while the author was on sabbatical at Microsoft Research

[1]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[2]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[3]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[4]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[5]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[6]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[7]  Kyuseok Shim,et al.  High-dimensional similarity joins , 1997, Proceedings 13th International Conference on Data Engineering.

[8]  Hans-Peter Kriegel,et al.  Fast nearest neighbor search in high-dimensional space , 1998, Proceedings 14th International Conference on Data Engineering.

[9]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[10]  R. Farebrother The Distribution of a Positive Linear Combination of X2 Random Variables , 1984 .

[11]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[12]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[13]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[14]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[15]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[18]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[19]  A. M. Mathai Quadratic forms in random variables , 1992 .

[20]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[21]  Stefan Berchtold,et al.  High-dimensional index structures database support for next decade's applications (tutorial) , 1998, SIGMOD '98.

[22]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[23]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[24]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[25]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[26]  Stefan Berchtold,et al.  High-Dimensional Index Structures : Databases Support for Next Decade's Applications's , 2000, ICDE 2000.

[27]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[28]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.