The Efficiency of Histogram-like Techniques for Database Query Optimization

One of the most difficult tasks in modern day database management systems is information retrieval. Basically, this task involves a user query, written in a high-level language such as the Structured Query Language, and some internal operations, which are transparent to the user. The internal operations are carried out through very complex modules that decompose, optimize and execute the different operations. We consider the problem of Query Optimization which consists of the system choosing, among many different query evaluation plans (QEPs), the most economical one. Since the number of QEPs increases exponentially as the number of relations involving the query increases, query optimization is a very complex problem. Many estimation techniques have been developed in order to approximate the cost of a QEP. Histogram-based techniques are the most used methods in this context. In this paper, we discuss the efficiency of some of these methods: Equi-width, Equi-depth, the Rectangular Attribute Cardinality Map (R-ACM) and the Trapezoidal Attribute Cardinality Map (T-ACM). These methods are used to estimate the cost of the different QEP, whence they attempt to determine the optimal one. It has been shown that the errors of the estimates from R-ACM and T-ACM are significantly less than the corresponding errors obtained from Equi-width and Equi-depth. This fact has been formally demonstrated using reasonable statistical distributions for the cost of a QEP, the doubly exponential distribution and the normal distribution. For the empirical analysis, we have developed a formal, rigorous prototype model used to analyze these methods on random databases. Our empirical results demonstrate that R-ACM chooses a superior QEP more than two times as often as Equi-width and Equi-depth. Similar results have been obtained for T-ACM when compared to the traditional methods. Indeed, in the most general scenario, we analytically provethat undercertain models thebetter theaccuracy of an estimation technique, the greater the probability of choosing the most efficient QEP.

[1]  B. John Oommen,et al.  Moment-Preserving Piecewise Linear Approximations of Signals and Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Guido Moerkotte,et al.  On the complexity of generating optimal plans with cross products (extended abstract) , 1997, PODS '97.

[3]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[4]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[5]  Stavros Christodoulakis,et al.  Estimating block transfers and join sizes , 1983, SIGMOD '83.

[6]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[7]  Toshihide Ibaraki,et al.  On the optimal nesting order for computing N-relational joins , 1984, TODS.

[8]  Guido Moerkotte,et al.  On the Complexity of Generating Optimal Left-Deep Processing Trees with Cross Products , 1995, ICDT.

[9]  E. Deeba,et al.  Interactive Linear Algebra with Maple V , 1998 .

[10]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[11]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[12]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[13]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[14]  B. John Oommen,et al.  Attribute cardinality maps: new query result size estimation techniques for database systems , 1999 .

[15]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[16]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[17]  Viswanath Poosala Histogram-Based Estimation Techniques in Database Systems , 1997 .

[18]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[19]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[20]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[21]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[22]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[23]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[24]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.