Smooth Interpolating Histograms with Error Guarantees

Accurate selectivity estimations are essential for query optimization decisions where they are typically derived from various kinds of histograms which condense value distributions into compact representations. The estimation accuracy of existing approaches typically varies across the domain, with some estimations being very accurate and some quite inaccurate. This is in particular unfortunate when performing a parametric search using these estimations, as the estimation artifacts can dominate the search results. We propose the usage of linear splines to construct histograms with known error guarantees across the whole continuous domain. These histograms are particularly well suited for using the estimates in parameter optimization. We show by a comprehensive performance evaluation using both synthetic and real world data that our approach clearly outperforms existing techniques.

[1]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[2]  Gerhard Weikum,et al.  Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation , 1999, VLDB.

[3]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[4]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[5]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[6]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[7]  Michael T. Goodrich Efficient piecewise-linear function approximation using the uniform metric , 1995, Discret. Comput. Geom..

[8]  Amit Kumar,et al.  Wavelet synopses for general error metrics , 2005, TODS.

[9]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[10]  Hua-Gang Li,et al.  Efficient Processing of Distributed Top-k Queries , 2005, DEXA.

[11]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[12]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[13]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  Sebastian Michel,et al.  Algebraic query optimization for distributed top-k queries , 2007, Informatik - Forschung und Entwicklung.

[15]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[16]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[17]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .