Efficient Sampling Strategies for Relational Database Operations

Abstract Recently, we have proposed an adaptive, random-sampling algorithm for general query size estimation in databases. In an earlier work we analyzed the asymptotic efficiency and accuracy of the algorithm; in this paper we investigate its practicality as applied to the relational database operations select, project, and join. We extend our previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy. Also, we provide “sanity bounds” to deal with queries for which the underlying data are extremely skewed or the query result is very small. We investigate how the existence of indices can be used to generate more efficient sampling algorithms for the operations of project and join. Finally, we report on the performance of the estimation algorithm, both as implemented in “stand alone” C programs and as implemented in a host language on a commericial relational system.

[1]  David J. DeWitt,et al.  Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries , 1988, SIGMOD Conference.

[2]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[3]  G. Barrie Wetherill,et al.  Sequential methods in statistics , 1967 .

[4]  Stavros Christodoulakis,et al.  Estimating block transfers and join sizes , 1983, SIGMOD '83.

[5]  Richard M. Karp,et al.  The Transitive Closure of a Random Digraph , 1990, Random Struct. Algorithms.

[6]  Robert Demolombe,et al.  Estimation of the Number of Tuples Satisfying a Query Expressed in Predicate Calculus Language , 1980, VLDB.

[7]  Neil C. Rowe,et al.  Antisampling for Estimation: An Overview , 1985, IEEE Transactions on Software Engineering.

[8]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[9]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[10]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[11]  Jeffrey D. Ullman,et al.  Principles of Database Systems , 1980 .

[12]  T. H. Merrett,et al.  Distribution Models Of Relations , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[13]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[14]  Larry Kerschberg,et al.  Query optimization in star computer networks , 1982, TODS.

[15]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[16]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[17]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[18]  L. A. Goodman On the Estimation of the Number of Classes in a Population , 1949 .

[19]  Dennis McLeod,et al.  On estimating the cardinality of the projection of a database relation , 1989, TODS.

[20]  David J. DeWitt,et al.  Benchmarking Database Systems A Systematic Approach , 1983, VLDB.

[21]  Erol Gelenbe,et al.  On the Size of Projections: I , 1982, Inf. Process. Lett..

[22]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[23]  Jeffrey F. Naughton,et al.  Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[24]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[25]  Roger King,et al.  A model of data distribution based on texture analysis , 1985, SIGMOD '85.

[26]  Danièle Gardy,et al.  On the sizes of projections: a generating function approach , 1984, Inf. Syst..

[27]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[28]  Jeffrey F. Naughton,et al.  Estimating the Size of Generalized Transitive Closures , 1989, VLDB.

[29]  Daryl J. D'Souza,et al.  The Cost of Relational Algebraic Operations on Skewed Data: Estimates and Experiments , 1983, IFIP Congress.

[30]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[31]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.