Consistently Estimating the Selectivity of Conjuncts of Predicates

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics (MVS) to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method efficiently exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Our implementation using a prototype version of DB2 UDB shows that ME improves the optimizer's cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[4]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[5]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[6]  Silviu Guiasu,et al.  The principle of maximum entropy , 1985 .

[7]  Clifford A. Lynch,et al.  Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values , 1988, VLDB.

[8]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[9]  Andrew K. C. Wong,et al.  Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis , 1991, Knowledge Discovery in Databases.

[10]  Allen Van Gelder Multiple join size estimation by virtual domains (extended abstract) , 1993, PODS '93.

[11]  Allen Van Gelder,et al.  Multiple Join Size Estimation by Virtual Domains. , 1993, PODS 1993.

[12]  Arun N. Swami,et al.  On the Estimation of Join Result Sizes , 1994, EDBT.

[13]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[14]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[15]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[16]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[17]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[18]  Warren R. Greiff,et al.  The maximum entropy approach and probabilistic IR models , 2000, TOIS.

[19]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[20]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[21]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[22]  Surajit Chaudhuri,et al.  Automating Statistics Management for Query Optimizers , 2001, IEEE Trans. Knowl. Data Eng..

[23]  Hermann Ney,et al.  Refined Lexikon Models for Statistical Machine Translation Using a Maximum Entropy Approach , 2001, ACL.

[24]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[25]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[26]  Surajit Chaudhuri,et al.  Efficient creation of statistics over query expressions , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  César A. Galindo-Legaria,et al.  Statistics on Views , 2003, VLDB.

[28]  Donald K. Burleson,et al.  Oracle Database 10g New Features: Oracle10g Reference for Advanced Tuning and Administration , 2003 .

[29]  Peter J. Haas,et al.  Automated Statistics Collection in DB2 UDB , 2004, VLDB.

[30]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[31]  Surajit Chaudhuri,et al.  Conditional selectivity for statistics on query expressions , 2004, SIGMOD '04.

[32]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .