General Database Statistics Using Entropy Maximization

We propose a framework in which query sizes can be estimated from arbitrary statistical assertions on the data. In its most general form, a statistical assertion states that the size of the output of a conjunctive query over the data is a given number. A very simple example is a histogram, which makes assertions about the sizes of the output of several range queries. Our model also allows much more complex assertions that include joins and projections. To model such complex statistical assertions we propose to use the Entropy-Maximization (EM) probability distribution. In this model any set of statistics that is consistent has a precise semantics, and every query has an precise size estimate. We show that several classes of statistics can be solved in closed form.

[1]  Matt Brown,et al.  Invited talk , 2007 .

[2]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[3]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[4]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[5]  Nilesh N. Dalvi Query Evaluation on a Database Given by a Random Graph , 2007, Theory of Computing Systems.

[6]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[7]  Joseph Y. Halpern,et al.  From Statistical Knowledge Bases to Degrees of Belief , 1996, Artif. Intell..

[8]  Dan Suciu,et al.  Asymptotic Conditional Probabilities for Conjunctive Queries , 2005, ICDT.

[9]  B. Bollobás The evolution of random graphs , 1984 .

[10]  Surajit Chaudhuri,et al.  Diagnosing Estimation Errors in Page Counts Using Execution Feedback , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[12]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[13]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.

[15]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[16]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[17]  Thomas Eiter,et al.  Database Theory - Icdt 2005 , 2008 .

[18]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[19]  Frank Olken,et al.  Random Sampling from Databases , 1993 .