Consistent Histograms In The Presence of Distinct Value Counts

Self-tuning histograms have been proposed in the past as an attempt to leverage feedback from query execution. However, the focus thus far has been on histograms that only store cardinalities. In this paper, we study consistent histogram construction from query feedback that also takes distinct value counts into account. We first show how the entropy maximization (EM) principle can be leveraged to identify a distribution that approximates the data given the execution feedback making the least additional assumptions. This EM model that takes both distinct value counts and cardinalities into account. However, we find that it is computationally prohibitively expensive. We thus consider an alternative formulation for consistency -- for a given query workload, the goal is to minimize the L2 distance between the true and estimated cardinalities. This approach also handles both cardinalities and distinct values counts. We propose an efficient one-pass algorithm with several theoretical properties modeling this formulation. Our experiments show that this approach produces similar improvements in accuracy as the EM based approach while being computationally significantly more efficient.

[1]  Wolfgang Lehner,et al.  Cardinality estimation using sample views with quality assurance , 2007, SIGMOD '07.

[2]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[3]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[4]  Surajit Chaudhuri,et al.  Efficient creation of statistics over query expressions , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[5]  Sudipto Guha,et al.  Fast algorithms for hierarchical range histogram construction , 2002, PODS '02.

[6]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[7]  Gerhard Weikum,et al.  Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation , 1999, VLDB.

[8]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[9]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[10]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[11]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[12]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  S. Muthukrishnan,et al.  Workload-Optimal Histograms on Streams , 2005, ESA.

[14]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[15]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[16]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[17]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[18]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries (extended abstract) , 2000, PODS '00.

[19]  Sudipto Guha,et al.  Histogramming Data Streams with Fast Per-Item Processing , 2002, ICALP.

[20]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[21]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[22]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[23]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..

[24]  Divesh Srivastava,et al.  Optimal histograms for hierarchical range queries , 2000, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[25]  Surajit Chaudhuri,et al.  Diagnosing Estimation Errors in Page Counts Using Execution Feedback , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  Jeffrey F. Naughton,et al.  Selectivity and Cost Estimation for Joins Based on Random Sampling , 1996, J. Comput. Syst. Sci..

[27]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[28]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[29]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[30]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.