Understanding cardinality estimation using entropy maximization

Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this paper we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.

[1]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[2]  E. Jaynes Probability theory : the logic of science , 2003 .

[3]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[4]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[5]  Salil P. Vadhan,et al.  Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[6]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[7]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Ben Taskar,et al.  Markov Logic: A Unifying Framework for Statistical Relational Learning , 2007 .

[10]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[11]  Surajit Chaudhuri,et al.  Diagnosing Estimation Errors in Page Counts Using Execution Feedback , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  P. Lachenbruch Mathematical Statistics, 2nd Edition , 1972 .

[13]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[14]  L. Wasserman,et al.  The Selection of Prior Distributions by Formal Rules , 1996 .

[15]  Dan Suciu,et al.  Asymptotic Conditional Probabilities for Conjunctive Queries , 2005, ICDT.

[16]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[17]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[18]  Robert M. Corless,et al.  A sequence of series for the Lambert W function , 1997, ISSAC.

[19]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[20]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[21]  Christoph Koch,et al.  World-set decompositions: Expressiveness and efficient algorithms , 2007, Theor. Comput. Sci..

[22]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[23]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[24]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[25]  Noga Alon,et al.  The Probabilistic Method, Second Edition , 2004 .

[26]  Dan Suciu,et al.  Consistent Histograms In The Presence of Distinct Value Counts , 2009, Proc. VLDB Endow..

[27]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[28]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[29]  Christopher Ré,et al.  Understanding cardinality estimation using entropy maximization , 2012, ACM Trans. Database Syst..

[30]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[31]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[32]  Christopher Ré,et al.  General Database Statistics Using Entropy Maximization , 2009, DBPL.

[33]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[35]  Dan Olteanu,et al.  World-Set Decompositions: Expressiveness and Efficient Algorithms , 2007, ICDT.

[36]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[37]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[38]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.