Understanding cardinality estimation using entropy maximization

Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this article we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.

[1]  Christoph Koch,et al.  World-set decompositions: Expressiveness and efficient algorithms , 2007, Theor. Comput. Sci..

[2]  Christopher Ré,et al.  Understanding cardinality estimation using entropy maximization , 2010, PODS '10.

[3]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[4]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS '07.

[5]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[6]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[7]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[8]  W. Rudin Principles of mathematical analysis , 1964 .

[9]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[10]  P. Lachenbruch Mathematical Statistics, 2nd Edition , 1972 .

[11]  Surajit Chaudhuri,et al.  Diagnosing Estimation Errors in Page Counts Using Execution Feedback , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[14]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[15]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[16]  Dan Suciu,et al.  Asymptotic Conditional Probabilities for Conjunctive Queries , 2005, ICDT.

[17]  Dan Suciu,et al.  Consistent Histograms In The Presence of Distinct Value Counts , 2009, Proc. VLDB Endow..

[18]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[19]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[20]  L. Wasserman,et al.  The Selection of Prior Distributions by Formal Rules , 1996 .

[21]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[22]  Christos H. Papadimitriou,et al.  Computational complexity , 1993 .

[23]  Ben Taskar,et al.  Markov Logic: A Unifying Framework for Statistical Relational Learning , 2007 .

[24]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[25]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[26]  Pedro M. Domingos 1 Markov Logic: A Unifying Framework for Statistical Relational Learning , 2010 .

[27]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[28]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[29]  Christopher Ré,et al.  General Database Statistics Using Entropy Maximization , 2009, DBPL.

[30]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Robert M. Corless,et al.  A sequence of series for the Lambert W function , 1997, ISSAC.

[32]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[33]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[35]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[36]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[37]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[38]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[39]  Michael I. Jordan Graphical Models , 1998 .

[40]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.