Efficient Aggregation Methods for Probabilistic Data Streams

In this paper, we consider aggregation algorithms for SUM operator for uncertain stream processing. Deterministic algorithms can not be used here because of uncertain data and high rates of data change, time and memory constraints. We compare the most promising available methods. Instead of full distribution functions of query result, we use a set of six parameters based on key moments and quantiles to describe the distributions. It enables us to perform fast recomputations of the aggregation with O(1) complexity. Experimental results demonstrate good performance of uncertain aggregation in comparison to deterministic case. We also found that usage of central limit theorem may be restricted to problems where data satisfy certain conditions.

[1]  David M. Bradley,et al.  On the Distribution of the Sum of n Non-Identically Distributed Uniform Random Variables , 2002, math/0411298.

[2]  Andrew McGregor,et al.  CLARO: modeling and processing uncertain data streams , 2012, The VLDB Journal.

[3]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[4]  Mostafa S. Haghjoo,et al.  Xtream: A System for Continuous Querying over Uncertain Data Streams , 2012, SUM.

[5]  Kun-Lung Wu,et al.  Sliding windows over uncertain data streams , 2014, Knowledge and Information Systems.

[6]  Jennifer Widom,et al.  Making Aggregation Work in Uncertain and Probabilistic Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  Anna Liu,et al.  PODS: a new model and processing algorithms for uncertain data streams , 2010, SIGMOD Conference.

[8]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[9]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[10]  Xiaoling Li,et al.  A survey of queries over uncertain data , 2013, Knowledge and Information Systems.

[11]  F. Killmann,et al.  A Note on the Convolution of the Uniform and Related Distributions and Their Use in Quality Control , 2001 .

[12]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, Proc. VLDB Endow..

[13]  Aris Spanos,et al.  Probability theory and statistical inference: econometric modelling with observational data , 1999 .

[14]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[15]  Stanley B. Zdonik,et al.  Handling Uncertain Data in Array Database Systems , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Andrew McGregor,et al.  Conditioning and aggregating uncertain data streams , 2010, Proc. VLDB Endow..

[17]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[18]  Omran Saleh,et al.  The PipeFlow approach , 2015, DEBS.