Improved histograms for selectivity estimation of range predicates

Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance.

[1]  A. Kolmogoroff Confidence Limits for an Unknown Distribution Function , 1941 .

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[4]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[5]  C. R. Deboor,et al.  A practical guide to splines , 1978 .

[6]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[7]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[8]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[9]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[10]  Imrich Chlamtac,et al.  The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations , 1985, CACM.

[11]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[12]  Kimmo E. E. Raatikainen,et al.  Simultaneous estimation of several percentiles , 1987, Simul..

[13]  Arun N. Swami,et al.  Optimization of large join queries , 1988, SIGMOD '88.

[14]  David J. DeWitt,et al.  Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries , 1988, SIGMOD Conference.

[15]  D. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[16]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[17]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[18]  Yannis E. Ioannidis,et al.  Randomized algorithms for optimizing large join queries , 1990, SIGMOD '90.

[19]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[20]  J. Srivastava,et al.  Equidepth partitioning of a data set based on finding its medians , 1991, [Proceedings] 1991 Symposium on Applied Computing.

[21]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[22]  Naphtali Rishe,et al.  An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment , 1993, SIGMOD '93.

[23]  Yannis E. Ioannidis,et al.  Universality of Serial Histograms , 1993, VLDB.

[24]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[25]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[26]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[27]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[28]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.