A comparison of selectivity estimators for range queries on metric attributes

In this paper, we present a comparison of nonparametric estimation methods for computing approximations of the selectivities of queries, in particular range queries. In contrast to previous studies, the focus of our comparison is on metric attributes with large domains which occur for example in spatial and temporal databases. We also assume that only small sample sets of the required relations are available for estimating the selectivity. In addition to the popular histogram estimators, our comparison includes so-called kernel estimation methods. Although these methods have been proven to be among the most accurate estimators known in statistics, they have not been considered for selectivity estimation of database queries, so far. We first show how to generate kernel estimators that deliver accurate approximate selectivities of queries. Thereafter, we reveal that two parameters, the number of samples and the so-called smoothing parameter, are important for the accuracy of both kernel estimators and histogram estimators. For histogram estimators, the smoothing parameter determines the number of bins (histogram classes). We first present the optimal smoothing parameter as a function of the number of samples and show how to compute approximations of the optimal parameter. Moreover, we propose a new selectivity estimator that can be viewed as an hybrid of histogram and kernel estimators. Experimental results show the performance of different estimators in practice. We found in our experiments that kernel estimators are most efficient for continuously distributed data sets, whereas for our real data sets the hybrid technique is most promising.

[1]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[2]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[3]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[4]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[5]  Jeffrey S. Simonoff,et al.  The Construction and Properties of Boundary Kernels for Smoothing Sparse Multinomials , 1994 .

[6]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[7]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[8]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[9]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[10]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[11]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[12]  B. Brodsky,et al.  Nonparametric Methods in Change Point Problems , 1993 .

[13]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[14]  J. Simonoff Multivariate Density Estimation , 1996 .

[15]  Joachim Engel,et al.  12 Nonparametric function estimation , 1993, Computational Statistics.