Cardinality estimation using sample views with quality assurance

Accurate cardinality estimation is critically important to high-quality query optimization. It is well known that conventional cardinality estimation based on histograms or similar statistics may produce extremely poor estimates in a variety of situations, for example, queries with complex predicates, correlation among columns, or predicates containing user-defined functions. In this paper, we propose a new, general cardinality estimation technique that combines random sampling and materialized view technology to produce accurate estimates even in these situations. As a major innovation, we exploit feedback information from query execution and process control techniques to assure that estimates remain statistically valid when the underlying data changes. Experimental results based on a prototype implementation in Microsoft SQL Server demonstrate the practicality of the approach and illustrate the dramatic effects improved cardinality estimates may have.

[1]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[2]  J. Pedoe,et al.  Sequential Methods in Statistics , 1966 .

[3]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[4]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[5]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[6]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[7]  Volker Markl,et al.  Learning table access cardinalities with LEO , 2002, SIGMOD '02.

[8]  F. Olken,et al.  Maintenance of materialized views of sampling queries , 1992, [1992] Eighth International Conference on Data Engineering.

[9]  J. Keith Ord,et al.  The poisson-inverse gaussian disiribuiion as a model for species abundance , 1986 .

[10]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[11]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[12]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[13]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[14]  Jeff Sauro,et al.  Estimating Completion Rates from Small Samples Using Binomial Confidence Intervals: Comparisons and Recommendations , 2005 .

[15]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[16]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[17]  Daniel Zelterman,et al.  Robust estimation in truncated discrete distributions with application to capture-recapture experiments , 1988 .

[18]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[19]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[20]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[21]  César A. Galindo-Legaria,et al.  Statistics on Views , 2003, VLDB.

[22]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[23]  P. Haas,et al.  Estimating the Number of Classes in a Finite Population , 1998 .

[24]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[25]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[26]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[27]  E. B. Wilson Probable Inference, the Law of Succession, and Statistical Inference , 1927 .

[28]  Hamid Pirahesh,et al.  Robust query processing through progressive optimization , 2004, SIGMOD '04.