On Using Correlation-based Synopses during Query Optimization

In recent years, a variety of complex synopsis structures have been proposed to capture statistical correlations present in database relations. The goal of that work has been to improve cardinality estimation during query optimization by improving statistics on "base tables". In order to use the correlation information for multi-table estimations as well, synopses must also be constructed on "intermediate result tables" during query optimization. We address that problem here, and show how to efficiently integrate multi-dimensional histograms and other complex synopsis structures into the search algorithm of a traditional query optimizer such as that of System R. A main challenge in this work is minimizing the computational expense of dynamically computing synopses on intermediate tables during query optimization. We show that one only needs to construct a very small subset of all possible synopses to get the full estimation benefits of multi-dimensional modeling. We also show that choosing this subset unwisely can have unacceptable performance overheads. This leads to an "estimation planning" problem that has been ignored to date in literature: how to choose the most efficiently-computable collection of intermediate synopses to construct, so that all the required cardinality estimates can be computed as accurately as possible from the base-table synopses? In this paper, we identify and formalize the estimation planning problem in the context of a System R-style optimizer extended with complex synopsis techniques such as multi-dimensional histograms. We propose algorithms that find very good estimation plans efficiently, and discuss properties of synopsis structures that make them amenable to efficient use in an optimizer.

[1]  Stavros Christodoulakis,et al.  Estimating block transfers and join sizes , 1983, SIGMOD '83.

[2]  Kevin D. Seppi,et al.  A Bayesian Approach to Database Query Optimization , 1993, INFORMS J. Comput..

[3]  Guy M. Lohman,et al.  Measuring the Complexity of Join Enumeration in Query Optimization , 1990, VLDB.

[4]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[5]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[6]  G. Antoshenkov,et al.  Dynamic query optimization in Rdb/VMS , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[7]  Yannis E. Ioannidis,et al.  Randomized algorithms for optimizing large join queries , 1990, SIGMOD '90.

[8]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[9]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[10]  Dan E. Willard,et al.  Optimal sample cost residues for differential database batch query problems , 1991, JACM.

[11]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[12]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[13]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[14]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[15]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[16]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[17]  Richard L. Cole A Decision Theoretic Cost Model for Dynamic Plans , 2000, IEEE Data Eng. Bull..

[18]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[19]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[20]  Phillip B. Gibbons,et al.  Approximate Query Processing: Taming the TeraBytes! A Tutorial , 2001 .

[21]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[22]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[23]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.