Efficient Threshold Monitoring for Distributed Probabilistic Data

In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem [4], [16], [19], [29]. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

[1]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[2]  S. Heber,et al.  Statistical Approaches to Identifying Androgen Response Elements , 2007 .

[3]  Graham Cormode,et al.  Algorithms for distributed functional monitoring , 2008, SODA '08.

[4]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[5]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[6]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2008, TODS.

[7]  Le Gruenwald,et al.  Using Data Mining to Estimate Missing Sensor Data , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[8]  Ling Huang,et al.  Communication-Efficient Tracking of Distributed Cumulative Triggers , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[9]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[10]  Honguk Woo,et al.  Real-Time Monitoring of Uncertain Data Streams Using Probabilistic Similarity , 2007, RTSS 2007.

[11]  Samuel Madden,et al.  Using Probabilistic Models for Data Management in Acquisitional Environments , 2005, CIDR.

[12]  Feifei Li,et al.  Ranking distributed probabilistic data , 2009, SIGMOD Conference.

[13]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[14]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[15]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[16]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[17]  Pushpraj Shukla,et al.  Efficient Constraint Monitoring Using Adaptive Thresholds , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[18]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[20]  Sunil Prabhakar,et al.  Threshold query optimization for uncertain data , 2010, SIGMOD Conference.

[21]  Jian Pei,et al.  Continuously monitoring top-k uncertain data streams: a probabilistic threshold method , 2009, Distributed and Parallel Databases.

[22]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[23]  Andrew McGregor,et al.  Conditioning and aggregating uncertain data streams , 2010, Proc. VLDB Endow..

[24]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[25]  Robert B. Ross,et al.  Aggregate operators in probabilistic databases , 2005, JACM.

[26]  Jennifer Widom,et al.  Making Aggregation Work in Uncertain and Probabilistic Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.

[27]  Honguk Woo,et al.  Real-Time Monitoring of Uncertain Data Streams Using Probabilistic Similarity , 2007, 28th IEEE International Real-Time Systems Symposium (RTSS 2007).

[28]  Subramanian Arumugam,et al.  Evaluation of probabilistic threshold queries in MCDB , 2010, SIGMOD Conference.

[29]  Shuang Wang,et al.  Distributed Frequent Items Detection on Uncertain Data , 2010, ADMA.

[30]  David J. Grabiner,et al.  Monte Carlo query processing of uncertain multidimensional array data , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Jennifer Widom,et al.  Representing uncertain data: models, properties, and algorithms , 2009, The VLDB Journal.

[32]  J. Norris Appendix: probability and measure , 1997 .

[33]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[34]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[35]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.