Selectivity estimation using probabilistic models

Estimating the result size of complex queries that involve selection on multiple attributes and the join of several relations is a difficult but fundamental task in database query processing. It arises in cost-based query optimization, query profiling, and approximate query answering. In this paper, we show how probabilistic graphical models can be effectively used for this task as an accurate and compact approximation of the joint frequency distribution of multiple attributes across multiple relations. Probabilistic Relational Models (PRMs) are a recent development that extends graphical statistical models such as Bayesian Networks to relational domains. They represent the statistical dependencies between attributes within a table, and between attributes across foreign-key joins. We provide an efficient algorithm for constructing a PRM front a database, and show how a PRM can be used to compute selectivity estimates for a broad class of queries. One of the major contributions of this work is a unified framework for the estimation of queries involving both select and foreign-key join operations. Furthermore, our approach is not limited to answering a small set of predetermined queries; a single model can be used to effectively estimate the sizes of a wide collection of potential queries across multiple tables. We present results for our approach on several real-world databases. For both single-table multi-attribute queries and a general class of select-join queries, our approach produces more accurate estimates than standard approaches to selectivity estimation, using comparable space and time.

[1]  J. Davenport Editor , 1960 .

[2]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[4]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[6]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Norbert Fuhr,et al.  A probabilistic relational model for the integration of IR and databases , 1993, SIGIR.

[9]  David Heckerman,et al.  Troubleshooting Under Uncertainty , 1994 .

[10]  Gregory M. Provan,et al.  Knowledge Engineering for Large Belief Networks , 1994, UAI.

[11]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[12]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[13]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[14]  Nir Friedman,et al.  Sequential Update of Bayesian Network Structure , 1997, UAI.

[15]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[16]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[17]  Gregory F. Cooper,et al.  A Multivariate Discretization Method for Learning Bayesian Networks from Mixed Data , 1998, UAI.

[18]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[19]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[20]  Avi Pfeffer,et al.  Probabilistic Frame-Based Systems , 1998, AAAI/IAAI.

[21]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[22]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[23]  G. Schoolnik,et al.  Comparative genomics of BCG vaccines by whole-genome DNA microarray. , 1999, Science.

[24]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[25]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[26]  Marie desJardins,et al.  Using Feature Hierarchies in Bayesian Network Learning , 2000, SARA.

[27]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.