Histograms and Wavelets on Probabilistic Data

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.

[1]  Sudipto Guha,et al.  A Note on Linear Time Algorithms for Maximum Error Histograms , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[3]  Torsten Suel,et al.  On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications , 1999, ICDT.

[4]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[5]  Dan Suciu,et al.  Foundations of probabilistic answers to queries , 2005, SIGMOD '05.

[6]  Christopher Ré,et al.  Access Methods for Markovian Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[8]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  Minos N. Garofalakis,et al.  Probabilistic wavelet synopses , 2004, TODS.

[10]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[11]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2008, TODS.

[12]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[13]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[14]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[15]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[16]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[17]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[18]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[19]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[21]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[22]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[23]  Graham Cormode,et al.  Probabilistic Histograms for Probabilistic Data , 2009, Proc. VLDB Endow..

[24]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[25]  Xuan Zheng,et al.  Topics in massive data summarization , 2008 .

[26]  Amit Kumar,et al.  Wavelet synopses for general error metrics , 2005, TODS.

[27]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[28]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[29]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[30]  GuhaSudipto,et al.  Approximation and streaming algorithms for histogram construction problems , 2006 .

[31]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[32]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[33]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[34]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[35]  Dan Suciu,et al.  Towards correcting input data errors probabilistically using integrity constraints , 2006, MobiDE '06.

[36]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..