Lineage processing over correlated probabilistic databases

In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of junction trees. We observe that evaluating even read-once (tree structured) lineages (e.g., those generated by hierarchical conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #P-complete for lightly correlated probabilistic databases like Markov sequences. We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called lwidth (analogous to the notion of treewidth). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases.

[1]  Uffe Kjærulff HUGS: Combining Exact Inference and Gibbs Sampling in junction Trees , 1995, UAI.

[2]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[3]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[4]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[5]  Carlos Guestrin,et al.  Efficient Principled Learning of Thin Junction Trees , 2007, NIPS.

[6]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[7]  Rajasekar Krishnamurthy,et al.  Uncertainty management in rule-based information extraction systems , 2009, SIGMOD Conference.

[8]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Paul D. Seymour,et al.  Graph Minors: XV. Giant Steps , 1996, J. Comb. Theory, Ser. B.

[10]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[11]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Christopher Ré,et al.  Access Methods for Markovian Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[14]  Paul D. Seymour,et al.  Graph minors. III. Planar tree-width , 1984, J. Comb. Theory B.

[15]  Uffe Kjaerulff HUGS: Combining Exact Inference and Gibbs Sampling in Junction Trees , 2011 .

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17]  Frank Jensen,et al.  Optimal junction Trees , 1994, UAI.

[18]  Ravi Kumar,et al.  A web of concepts , 2009, PODS.

[19]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[20]  Mark A. Paskin,et al.  Thin Junction Tree Filters for Simultaneous Localization and Mapping , 2002, IJCAI.

[21]  Adnan Darwiche,et al.  Inference in belief networks: A procedural guide , 1996, Int. J. Approx. Reason..

[22]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[23]  Udi Rotics,et al.  Factoring and recognition of read-once functions using cographs and normality , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[24]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[25]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[26]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[27]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[28]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[29]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[31]  Adnan Darwiche,et al.  An Edge Deletion Semantics for Belief Propagation and its Practical Impact on Approximation Quality , 2006, AAAI.

[32]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[33]  Amol Deshpande,et al.  Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.