Wavelet synopses with error guarantees

Recent work has demonstrated the effectiveness of the wavelet decomposition in reducing large amounts of data to compact sets of wavelet coefficients (termed "wavelet synopses") that can be used to provide fast and reasonably accurate approximate answers to queries. A major criticism of such techniques is that unlike, for example, random sampling, conventional wavelet synopses do not provide informative error guarantees on the accuracy of individual approximate answers. In fact, as this paper demonstrates, errors can vary widely (without bound) and unpredictably, even for identical queries on nearly-identical values in distinct parts of the data. This lack of error guarantees severely limits the practicality of traditional wavelets as an approximate query-processing tool, because users have no idea of the quality of any particular approximate answer. In this paper, we introduce Probabilistic Wavelet Synopses, the first wavelet-based data reduction technique with guarantees on the accuracy of individual approximate answers. Whereas earlier approaches rely on deterministic thresholding for selecting a set of "good" wavelet coefficients, our technique is based on a novel, probabilistic thresholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values, and then flips coins to select the synopsis. We show how our scheme avoids the above pitfalls of deterministic thresholding, providing highly-accurate answers for individual data values in a data vector. We propose several novel optimization algorithms for tuning our probabilistic thresholding scheme to minimize desired error metrics. Experimental results on real-world and synthetic data sets evaluate these algorithms, and demonstrate the effectiveness of our probabilistic wavelet synopses in providing fast, highly-accurate answers with error guarantees.

[1]  Prabhakar Raghavan,et al.  Randomized rounding: A technique for provably good algorithms and algorithmic proofs , 1985, Comb..

[2]  X. M. Yu,et al.  Probabilistic discrete wavelet approximation , 1992 .

[3]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[4]  X. M. Yu,et al.  Monotone and probabilistic wavelet approximation , 1992 .

[5]  Wim Sweldens,et al.  An Overview of Wavelet Based Multiresolution Analyses , 1994, SIAM Rev..

[6]  David Salesin,et al.  Wavelets for computer graphics: a primer. 2 , 1995, IEEE Computer Graphics and Applications.

[7]  David Salesin,et al.  Wavelets for computer graphics - theory and applications , 1996, The Morgan Kaufmann series in computer graphics and geometric modeling.

[8]  Peter Schröder,et al.  Wavelets in computer graphics , 1996, Proc. IEEE.

[9]  R. DeVore,et al.  Nonlinear approximation , 1998, Acta Numerica.

[10]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[11]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[12]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[13]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[14]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[15]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[16]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.