Practical selectivity estimation through adaptive sampling

Recently we have proposed an adaptive, random sampling algorithm for general query size estimation. In earlier work we analyzed the asymptotic efficiency and accuracy of the algorithm, in this paper we investigate its practicality as applied to selects and joins. First, we extend our previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy. Next, we provide “sanity bounds” to deal with queries for which the underlying data is extremely skewed or the query result is very small. Finally, we report on the performance of the estimation algorithm as implemented in a host language on a commercial relational system. The results are encouraging, even with this loose coupling between the estimation algorithm and the DBMS.

[1]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[2]  Larry Kerschberg,et al.  Query optimization in star computer networks , 1982, TODS.

[3]  David J. DeWitt,et al.  Benchmarking Database Systems A Systematic Approach , 1983, VLDB.

[4]  Robert Demolombe,et al.  Estimation of the Number of Tuples Satisfying a Query Expressed in Predicate Calculus Language , 1980, VLDB.

[5]  David J. DeWitt,et al.  Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries , 1988, SIGMOD Conference.

[6]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[7]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[8]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[9]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[10]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[11]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[12]  Jeffrey F. Naughton,et al.  Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[13]  Roger King,et al.  A model of data distribution based on texture analysis , 1985, SIGMOD '85.

[14]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[15]  Daryl J. D'Souza,et al.  The Cost of Relational Algebraic Operations on Skewed Data: Estimates and Experiments , 1983, IFIP Congress.

[16]  Jeffrey F. Naughton,et al.  Estimating the Size of Generalized Transitive Closures , 1989, VLDB.

[17]  Jane Fedorowicz Database evaluation using multiple regression techniques , 1984, SIGMOD '84.

[18]  OzsoyogluGultekin,et al.  Processing aggregate relational queries with hard time constraints , 1989 .

[19]  Stavros Christodoulakis,et al.  Estimating block transfers and join sizes , 1983, SIGMOD '83.

[20]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..