Selectivity and Cost Estimation for Joins Based on Random Sampling

We compare the performance of sampling-based procedures for estimating the selectivity of a join. While some of the procedures have been proposed in the database literature, their relative performance has never been analyzed. A main result of this paper is a partial ordering that compares the variability of the estimators for the different procedures after an arbitrary fixed number of sampling steps. Prior to the current work, it was also unknown whether these fixed-step procedures could be extended to fixed-precision procedures that are both asymptotically consistent and asymptotically efficient. Our second main result is a general method for such an extension and a proof that the method is valid for all the procedures under consideration. We show that, under plausible assumptions on sampling costs, the partial ordering of the fixed-step procedures with respect to variability of the selectivity estimator implies a partial ordering of the corresponding fixed-precision procedures with respect to sampling cost. Our final result is a collection of fixed-step and fixed-precision procedures for estimating the cost of processing a join query according to a fixed join plan.

[1]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[2]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[3]  Jeffrey F. Naughton,et al.  Fixed-precision estimation of join selectivity , 1993, PODS '93.

[4]  Jeffrey F. Naughton,et al.  Estimating the Size of Generalized Transitive Closures , 1989, VLDB.

[5]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[6]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[7]  A. Gut Stopped Random Walks: Limit Theorems and Applications , 1987 .

[8]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[9]  H. Robbins,et al.  ON THE ASYMPTOTIC THEORY OF FIXED-WIDTH SEQUENTIAL CONFIDENCE INTERVALS FOR THE MEAN. , 1965 .

[10]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[11]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[12]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[13]  Balakrishna R. Iyer,et al.  A polynomial time algorithm for optimizing join queries , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[14]  Allen Van Gelder Multiple join size estimation by virtual domains (extended abstract) , 1993, PODS '93.

[15]  Edward A. Youngs,et al.  Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms , 1971 .

[16]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[17]  Laura M. Haas,et al.  Seeking the truth about ad hoc join costs , 1997, The VLDB Journal.

[18]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[19]  Patrick Billingsley,et al.  Probability and Measure. , 1986 .

[20]  F. J. Anscombe,et al.  Large-sample theory of sequential estimation , 1949, Mathematical Proceedings of the Cambridge Philosophical Society.

[21]  Jeffrey F. Naughton,et al.  On the relative cost of sampling for join selectivity estimation , 1994, PODS '94.

[22]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[23]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[24]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[25]  A. Nádas An Extension of a Theorem of Chow and Robbins on Sequential Confidence Intervals for the Mean , 1969 .

[26]  Allen Van Gelder,et al.  Multiple Join Size Estimation by Virtual Domains. , 1993, PODS 1993.

[27]  Michael Hogan Moments of the Minimum of a Random Walk and Complete Convergence. , 1983 .

[28]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[29]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[30]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[31]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  Kevin D. Seppi,et al.  A Bayesian Approach to Database Query Optimization , 1993, INFORMS J. Comput..

[34]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[35]  Jeffrey F. Naughton,et al.  Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[36]  S. Seshadri Probabilistic methods in query processing , 1992 .