A cost model for similarity queries in metric spaces

We consider the problem of estimating CPU (distance computations) and I/O costs for processing range and k-nearest neighbors queries over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the performance of multi-dimensional access methods, in a generic metric space there is no such a possibility, which makes the problem quite different and requires a novel approach. We insist that the distance distribution of objects can be profitably used to solve the problem, and consequently develop a concrete cost model for the M-tree access method [10]. Our results rely on the assumption that the indexed dataset comes from a metric space which is “homogeneous” enough (in a probabilistic sense) to allow reliable cost estimations even if the distance distribution with respect to a specific query object is unknown. We experimentally validate the model over both real and synthetic datasets, and show how the model can be used to tune the M-tree in order to minimize a combination of CPU and I/O costs. Finally, we sketch how the same approach can be applied to derive a cost model for the vp-tree index structure [8].

[1]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[2]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[4]  Yannis Manolopoulos,et al.  Performance of Nearest Neighbor Queries in R-Trees , 1997, ICDT.

[5]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[6]  Christos Faloutsos,et al.  On packing R-trees , 1993, CIKM '93.

[7]  Jeffrey F. Naughton,et al.  Generalized Search Trees for Database Systems , 1995, VLDB.

[8]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[9]  Pavel Zezula,et al.  Processing Complex Similarity Queries with Distance-Based Access Methods , 1998, EDBT.

[10]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[11]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[12]  Timos K. Sellis,et al.  A model for the prediction of R-tree performance , 1996, PODS.

[13]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[14]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[15]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[16]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[17]  Nasser Yazdani,et al.  Matching and indexing sequences of different lengths , 1997, CIKM '97.

[18]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[19]  B. Mandelbrot Fractal Geometry of Nature , 1984 .

[20]  Karl Aberer,et al.  Efficient querying on genomic databases by using metric space indexing techniques , 1997, Database and Expert Systems Applications. 8th International Conference, DEXA '97. Proceedings.

[21]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[22]  Dario Maio,et al.  A structural approach to fingerprint classification , 1996, Proceedings of 13th International Conference on Pattern Recognition.