Representing and Querying Correlated Tuples in Probabilistic Databases

Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an inference problem in an appropriately constructed probabilistic graphical model. We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental evaluation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets.

[1]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[2]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[3]  DeyDebabrata,et al.  A probabilistic relational model and algebra , 1996 .

[4]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[5]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[6]  R. Dechter,et al.  Efficient reasoning in graphical models , 1999 .

[7]  Henri Prade,et al.  Generalizing Database Relational Algebra for the Treatment of Incomplete/Uncertain Information and Vague Queries , 1984, Inf. Sci..

[8]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[9]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[10]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[12]  Henk Ernst Blok,et al.  Handling Uncertainty and Ignorance in Databases: A Rule to Combine Dependent Data , 2006, DASFAA.

[13]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[14]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[15]  Joseph Y. Halpern An Analysis of First-Order Logics of Probability , 1989, IJCAI.

[16]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[17]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[18]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[19]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[20]  Gösta Grahne Horn tables-an efficient tool for handling incomplete information in databases , 1989, PODS '89.

[21]  Stan Matwin,et al.  Canadian Conference on Artificial Intelligence , 2009 .

[22]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[23]  Sunil Prabhakar,et al.  Indexing continuously changing data with mean-variance tree , 2005, SAC '05.

[24]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[25]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.

[26]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[27]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[28]  B. Buckles,et al.  A fuzzy representation of data for relational databases , 1982 .

[29]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[30]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[32]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[33]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[34]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[35]  Nevin Lianwen Zhang,et al.  Exploiting Causal Independence in Bayesian Network Inference , 1996, J. Artif. Intell. Res..

[36]  Patrick Bosc,et al.  About projection-selection-join queries addressed to possibilistic relational databases , 2005, IEEE Transactions on Fuzzy Systems.

[37]  Suk Kyoon Lee,et al.  An Extended Relational Database Model for Uncertain and Imprecise Information , 1992, VLDB.

[38]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[39]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[40]  Nevin L. Zhang,et al.  A simple approach to Bayesian network computations , 1994 .

[41]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[42]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[43]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[44]  Thomas Lukasiewicz,et al.  A data model and algebra for probabilistic complex values , 2001, Annals of Mathematics and Artificial Intelligence.