BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses require sophisticated machine-learning tools that can effectively model the complex spatio/temporal correlation patterns present in uncertain sensory data. Unfortunately, to date, most existing approaches to probabilistic database systems have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilistic information is typically associated with individual data tuples, with only limited or no support for effectively capturing and reasoning about complex data correlations. In this paper, we introduce BayesStore, a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. Adopting a machine-learning view, BAYESSTORE employs concise statistical relational models to effectively encode the correlation patterns between uncertain data, and promotes probabilistic inference and statistical model manipulation as part of the standard DBMS operator repertoire to support efficient and sound query processing. We present BAYESSTORE's uncertainty model based on a novel, first-order statistical model, and we redefine traditional query processing operators, to manipulate the data and the probabilistic models of the database in an efficient manner. Finally, we validate our approach, by demonstrating the value of exploiting data correlations during query processing, and by evaluating a number of optimizations which significantly accelerate query processing.

[1]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[2]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[3]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[4]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[5]  Michael I. Jordan Graphical Models , 2003 .

[6]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[7]  Daphne Koller,et al.  Probabilistic reasoning for complex systems , 1999 .

[8]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[9]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[10]  David Poole,et al.  First-order probabilistic inference , 2003, IJCAI.

[11]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[12]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[13]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[14]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[17]  Lise Getoor,et al.  Representing Tuple and Attribute Uncertainty in Probabilistic Databases , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[18]  Daisy Zhe Wang,et al.  Analysis of Relationship between Blood Stream Infection and Clinical Background in Patients' Lactobacillus Therapy by Data Mining , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[19]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[20]  J. Hellerstein,et al.  Granularity Conscious Modeling for Probabilistic Databases , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[21]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.