Top-k selection queries over relational databases: Mapping strategies and performance evaluation

In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that a traditional relational database management system (RDBMS) can process efficiently. In particular, we study how to determine a range query to evaluate a top-k query by exploiting the statistics available to an RDBMS, and the impact of the quality of these statistics on the retrieval efficiency of the resulting scheme. We also report the first experimental evaluation of the mapping strategies over a real RDBMS, namely over Microsoft's SQL Server 7.0. The experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets.

[1]  Chung-Min Chen,et al.  A Sampling-Based Estimator for Top-k Query. , 2002, ICDE 2002.

[2]  Chung-Min Chen,et al.  A sampling-based estimator for top-k selection query , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Christos Faloutsos,et al.  Relaxing the Uniformity and Independence Assumptions Using the Concept of Fractal Dimension , 1997, J. Comput. Syst. Sci..

[4]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[5]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD 2000.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[8]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[9]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[10]  Chad Carson,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD '96.

[11]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[12]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[13]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[14]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[15]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[16]  Luis Gravano,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD 1996.

[17]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[18]  Luis Gravano,et al.  Performance of Multiattribute Top-K Queries on Relational Systems , 2000 .

[19]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[20]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[21]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[22]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[23]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[24]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[25]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[26]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[27]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[28]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[29]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[30]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[31]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[32]  GravanoLuis,et al.  Top-k selection queries over relational databases , 2002 .

[33]  Joseph M. Hellerstein,et al.  CONTROL: continuous output and navigation technology with refinement on-line , 1998, SIGMOD '98.

[34]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[35]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[36]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[37]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[38]  Vagelis Hristidis,et al.  PREFER: a system for the efficient execution of multi-parametric ranked queries , 2001, SIGMOD '01.

[39]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[40]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[41]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.