PAC Nearest Neighbor Queries: Using the Distance Distribution for Searching in High-Dimensional Metric Spaces

In this paper we introduce a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the “dimensionality curse” which inhibits current approaches to be applied in high-dimensional spaces. PAC-NN queries return, with probability at least 1− δ, a (1+ )-approximate NN – an object whose distance from the query q is less than (1 + ) times the distance between q and its NN. We describe how the distance distribution of the query object can be used to determine a suitable stopping condition with probabilistic guarantees on the quality of the result, and then analyze performance of both sequential and index-based PAC-NN algorithms. This shows that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff between the accuracy of the result and the cost.

[1]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[2]  Hector Garcia-Molina,et al.  Filtering with Approximate Predicates , 1998, VLDB.

[3]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[4]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[5]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[6]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[8]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1999, Discret. Comput. Geom..

[9]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[10]  Pavel Zezula,et al.  Processing Complex Similarity Queries with Distance-Based Access Methods , 1998, EDBT.

[11]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[12]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[13]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[14]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[15]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[16]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[17]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[18]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[19]  Marco Patella,et al.  A Query-sensitive Cost Model for Similarity Queries with M-tree , 1999, Australasian Database Conference.

[20]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[21]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .