论文信息 - PAC Nearest Neighbor Queries: Using the Distance Distribution for Searching in High-Dimensional Metric Spaces

PAC Nearest Neighbor Queries: Using the Distance Distribution for Searching in High-Dimensional Metric Spaces

In this paper we introduce a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the “dimensionality curse” which inhibits current approaches to be applied in high-dimensional spaces. PAC-NN queries return, with probability at least 1− δ, a (1+ )-approximate NN – an object whose distance from the query q is less than (1 + ) times the distance between q and its NN. We describe how the distance distribution of the query object can be used to determine a suitable stopping condition with probabilistic guarantees on the quality of the result, and then analyze performance of both sequential and index-based PAC-NN algorithms. This shows that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff between the accuracy of the result and the cost.

Marco Patella | Paolo Ciaccia

[1] C. Faloutsos. Eecient Similarity Search in Sequence Databases , 1993 .

[2] Hector Garcia-Molina,et al. Filtering with Approximate Predicates , 1998, VLDB.

[3] Hans-Jörg Schek,et al. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[4] Pavel Zezula,et al. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[5] Ronald Fagin,et al. Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[6] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7] Markus A. Stricker,et al. Similarity of color images , 1995, Electronic Imaging.

[8] Kenneth L. Clarkson,et al. Nearest Neighbor Queries in Metric Spaces , 1999, Discret. Comput. Geom..

[9] Christian Böhm,et al. A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[10] Pavel Zezula,et al. Processing Complex Similarity Queries with Distance-Based Access Methods , 1998, EDBT.

[11] Shin'ichi Satoh,et al. The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[12] Pavel Zezula,et al. A cost model for similarity queries in metric spaces , 1998, PODS '98.

[13] Z. Meral Özsoyoglu,et al. Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[14] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[15] Christos Faloutsos,et al. Efficient Similarity Search In Sequence Databases , 1993, FODO.

[16] Sunil Arya,et al. An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[17] Vladimir Pestov,et al. On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[18] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[19] Marco Patella,et al. A Query-sensitive Cost Model for Similarity Queries with M-tree , 1999, Australasian Database Conference.

[20] Antonin Guttman,et al. R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[21] K. Wakimoto,et al. Efficient and Effective Querying by Image Content , 1994 .