Histograms based on the minimum description length principle

Histograms have been widely used for selectivity estimation in query optimization, as well as for fast approximate query answering in many OLAP, data mining, and data visualization applications. This paper presents a new family of histograms, the Hierarchical Model Fitting (HMF) histograms, based on the Minimum Description Length principle. Rather than having each bucket of a histogram described by the same type of model, the HMF histograms employ a local optimal model for each bucket. The improved effectiveness of the locally chosen models offsets more than the overhead of keeping track of the representation of each individual bucket. Through a set of experiments, we show that the HMF histograms are capable of providing more accurate approximations than previously proposed techniques for many real and synthetic data sets across a variety of query workloads.

[1]  AgrawalRakesh,et al.  Range queries in OLAP data cubes , 1997 .

[2]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[3]  Minos N. Garofalakis,et al.  Wavelet synopses with error guarantees , 2002, SIGMOD '02.

[4]  Gerhard Weikum,et al.  Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation , 1999, VLDB.

[5]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[6]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[7]  Francesco Buccafurri,et al.  Improving range query estimation on histograms , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[11]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[12]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[13]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[14]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[15]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[16]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[17]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[18]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[19]  S. Muthukrishnan,et al.  Optimal and approximate computation of summary statistics for range aggregates , 2001, PODS '01.

[20]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[21]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[22]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[23]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[24]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[25]  VitterJeffrey Scott,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999 .

[26]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[27]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[28]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[29]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[30]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[31]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[32]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[33]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[34]  Hai Wang,et al.  Concise and accurate data summaries for fast approximate query answering , 2004 .

[35]  Terence R. Smith,et al.  Relative prefix sums: an efficient approach for querying dynamic OLAP data cubes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).