Approximating multi-dimensional aggregate range queries over real attributes

Finding approximate answers to multi-dimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper we consider the following problem: given a table of d attributes whose domain is the real numbers, and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. We present a new histogram technique that is designed to approximate the density of multi-dimensional datasets with real attributes. Our technique finds buckets of variable size, and allows the buckets to overlap. Overlapping buckets allow more efficient approximation of the density. The size of the cells is based on the local density of the data. This technique leads to a faster and more compact approximation of the data distribution. We also show how to generalize kernel density estimators, and how to apply them on the multi-dimensional query approximation problem. Finally, we compare the accuracy of the proposed techniques with existing techniques using real and synthetic datasets.

[1]  Theodore Johnson,et al.  Range selectivity estimation for continuous attributes , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[2]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[3]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[4]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[5]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[6]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[7]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[8]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[9]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[10]  Sanjeev Khanna,et al.  On approximating rectangle tiling and packing , 1998, SODA '98.

[11]  Viswanath Poosala,et al.  Fast approximate answers to aggregate queries on a data cube , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[12]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[13]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[14]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[15]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[16]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[17]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[18]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[19]  Sridhar Ramaswamy,et al.  Selectivity estimation in spatial databases , 1999, SIGMOD '99.

[20]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[21]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[22]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[23]  Gerhard Weikum,et al.  Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation , 1999, VLDB.

[24]  Paul S. Bradley,et al.  Compressed data cubes for OLAP aggregate query approximation on continuous dimensions , 1999, KDD '99.

[25]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[26]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[27]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[28]  Bernhard Seeger,et al.  A comparison of selectivity estimators for range queries on metric attributes , 1999, SIGMOD '99.

[29]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[30]  Timos K. Sellis,et al.  A model for the prediction of R-tree performance , 1996, PODS.

[31]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[32]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[33]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .