Synopses for probabilistic data over large domains

Many real world applications produce data with uncertainties drawn from measurements over a continuous domain space. Recent research in the area of probabilistic databases has mainly focused on managing and querying discrete data in which the domain is limited to a small number of values (i.e. on the order of 10). When the size of the domain increases, current methods fail due to their nature of explicitly storing each value/probability pair. Such methods are not capable of extending their use to continuous-valued attributes. In this paper, we provide a scalable, accurate, space efficient probabilistic data synopsis for uncertain attributes defined over a continuous domain. Our synopsis construction methods are all error-aware to ensure that our synopsis provides an accurate representation of the underlying data given a limited space budget. Additionally, we are able to provide approximate query results over the synopsis with error bounds. We provide an extensive experimental evaluation to show that our proposed methods improve upon the current state of the art in terms of construction time and query accuracy. In particular, our synopsis can be constructed in O(N2) time (where N is the number of tuples in the database). We also demonstrate the ability of our synopsis to answer a variety of interesting queries on a real data set and show that our query error is reduced by up to an order of magnitude over the previous state-of-the-art method.

[1]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[3]  Graham Cormode,et al.  Probabilistic Histograms for Probabilistic Data , 2009, Proc. VLDB Endow..

[4]  Amit Kumar,et al.  Deterministic wavelet thresholding for maximum-error metrics , 2004, PODS.

[5]  Anthony Ralston,et al.  Rational chebyshev approximation by Remes' algorithms , 1965 .

[6]  T. J. Rivlin The Chebyshev polynomials , 1974 .

[7]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[8]  Charles B. Dunham,et al.  Remez algorithm for Chebyshev approximation with interpolation , 1982, Computing.

[9]  Sheila A. McIlraith,et al.  Monitoring a Complez Physical System using a Hybrid Dynamic Bayes Net , 2002, UAI.

[10]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[11]  Beng Chin Ooi,et al.  Global optimization of histograms , 2001, SIGMOD '01.

[12]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[13]  Dimitris Sacharidis,et al.  Exploiting duality in summarization with deterministic guarantees , 2007, KDD '07.

[14]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..

[15]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[16]  L. Veidinger,et al.  On the numerical determination of the best approximations in the Chebyshev sense , 1960 .

[17]  Theodore Johnson,et al.  Range selectivity estimation for continuous attributes , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[18]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[19]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[21]  J. Dicapua Chebyshev Polynomials , 2019, Fibonacci and Lucas Numbers With Applications.

[22]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[23]  T. J. Rivlin Chebyshev polynomials : from approximation theory to algebra and number theory , 1990 .

[24]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[25]  Anthony K. H. Tung,et al.  ItCompress: an iterative semantic compression algorithm , 2004, Proceedings. 20th International Conference on Data Engineering.

[26]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.