On Approximate Algorithms for Distance-Based Queries using R-trees

In modern database applications the similarity or dissimilarity of complex objects is examined by performing distance-based queries (DBQs) on data of high dimensionality. The R-tree and its variations are commonly cited multidimensional access methods that can beused for answering such queries. Although the related algorithms work well for low-dimensional data spaces, their performance degrades as the number of dimensions increases (dimensionality curse). In order to obtain acceptable response time in high-dimensional data spaces, algorithms that obtain approximate solutions can be used. Approximation techniques, like N-consider (based on the tree structure), α-allowance and e-approximate (based on distance), or Time-consider (based on time) can be applied in branch-and-bound algorithms for DBQs inorder to control the trade-off between cost and accuracy of the result. In this paper, we improve previous approximate DBQ algorithms by applying a combination of the approximation techniques in the same query algorithm (hybrid approximation scheme). We investigate the performance of these improvements for one of the most representative DBQs (the K-closest pairs query, K-CPQ) in high-dimensional data spaces, as well as the influence of the algorithmic parameters on the control of the trade-off between the response time and the accuracy of the result. The outcome of the experimental evaluation, using synthetic and real datasets, is the derivation of the outperforming DBQ approximate algorithm for large high-dimensional point datasets.

[1]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[2]  Yannis Manolopoulos,et al.  Advanced Database Indexing , 1999, Advances in Database Systems.

[3]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[4]  Pavel Zezula,et al.  Approximate similarity retrieval with M-trees , 1998, The VLDB Journal.

[5]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[6]  Sukho Lee,et al.  Adaptive multi-stage distance join processing , 2000, SIGMOD '00.

[7]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Weixiong Zhang,et al.  Depth-First Branch-and-Bound versus Local Search: A Case Study , 2000, AAAI/IAAI.

[9]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[10]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[11]  Michael Vassilakopoulos,et al.  Approximate Algorithms for Distance-Based Queries in High-Dimensional Data Spaces Using R-Trees , 2002, ADBIS.

[12]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[13]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[14]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[15]  Christian Böhm,et al.  High performance data mining using the nearest neighbor join , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  Mark S. Boddy,et al.  An Analysis of Time-Dependent Planning , 1988, AAAI.

[17]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[18]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[19]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[20]  Andreas Henrich,et al.  The LSD/sup h/-tree: an access structure for feature vectors , 1998, Proceedings 14th International Conference on Data Engineering.

[21]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[22]  Masatoshi Yoshikawa,et al.  The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation , 2000, VLDB.

[23]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[24]  Michael Stonebraker,et al.  The Asilomar report on database research , 1998, SGMD.

[25]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[26]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[27]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[28]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[29]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[30]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[31]  Sharad Mehrotra,et al.  The hybrid tree: an index structure for high dimensional feature spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[32]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[33]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[34]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[35]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.