Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams

Many real world applications such as sensor networks and other monitoring applications naturally generate probabilistic streams that are highly correlated in both time and space. Query processing over such streaming data must be cognizant of these correlations, since they can significantly alter the final query results. Several prior works have suggested approaches to handling correlations in probabilistic databases. However those approaches are either unable to represent the types of correlations that probabilistic streams exhibit, or can not be applied directly because of their complexity. In this paper, we develop a system for managing and querying such streams by exploiting the fact that most real-world probabilistic streams exhibit highly structured Markovian correlations. Our approach is based on the previously proposed framework of viewing probabilistic query evaluation as inference over graphical models; we show how to efficiently construct graphical models for the common stream processing operators, and how to efficiently perform inference over them in an incremental fashion. We also present an algorithm for operator ordering that judiciously rearranges the query operators to make the query evaluation tractable, if possible given the query. Our extensive experimental evaluation illustrates the advantages of exploiting the structured nature of correlations in probabilistic streams.

[1]  Miron Livny,et al.  SEQ: A model for sequence databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[2]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[3]  Tony Jan,et al.  Machine Learning Techniques and Use of Event Information for Stock Market Prediction: A Survey and Evaluation , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[4]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[5]  Rina Dechter,et al.  Mini-buckets: A general scheme for bounded inference , 2003, JACM.

[6]  Ian F. Akyildiz,et al.  Wireless sensor networks: a survey , 2002, Comput. Networks.

[7]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[8]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[9]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[10]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[11]  Henry A. Kautz,et al.  Inferring High-Level Behavior from Low-Level Sensors , 2003, UbiComp.

[12]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[13]  Nevin Lianwen Zhang,et al.  Exploiting Causal Independence in Bayesian Network Inference , 1996, J. Artif. Intell. Res..

[14]  Frank Jensen,et al.  Optimal junction Trees , 1994, UAI.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[17]  Forsyth,et al.  Computer Vision , 2007 .

[18]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2007, PODS.

[19]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  John Anderson,et al.  Wireless sensor networks for habitat monitoring , 2002, WSNA '02.

[22]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[23]  Jimeng Sun,et al.  InteMon: intelligent system monitoring on large clusters , 2006, VLDB.

[24]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[25]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[26]  Deborah Estrin,et al.  Habitat monitoring: application driver for wireless communications technology , 2001, SIGCOMM LA '01.

[27]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.