Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases

Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. We propose a new approach by which every query is evaluated inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two well-known scoring functions on graphs, namely graph reliability (which is #P-hard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for self-join-free conjunctive queries and prove that it is always an upper bound for query reliability, and that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a highly efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.

[1]  H. Chertkow,et al.  Semantic memory , 2002, Current neurology and neuroscience reports.

[2]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[4]  Adnan Darwiche,et al.  Relax then compensate: on max-product belief propagation and more , 2009, NIPS 2009.

[5]  Lise Getoor,et al.  Read-once functions and query evaluation in probabilistic databases , 2010, Proc. VLDB Endow..

[6]  Val Tannen,et al.  Faster query answering in probabilistic databases using read-once functions , 2010, ICDT '11.

[7]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[8]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Tova Milo,et al.  Deriving probabilistic databases with inference ensembles , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  Chris Jermaine,et al.  Sampling-based estimators for subset-based queries , 2008, The VLDB Journal.

[11]  Peter L. Hammer,et al.  Boolean Functions - Theory, Algorithms, and Applications , 2011, Encyclopedia of mathematics and its applications.

[12]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[13]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[14]  Dan Olteanu,et al.  On the optimal approximation of queries using tractable propositional languages , 2011, ICDT '11.

[15]  Dan Olteanu,et al.  MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Dan Olteanu,et al.  Secondary-storage confidence computation for conjunctive queries with inequalities , 2009, SIGMOD Conference.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Marc Najork,et al.  Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores , 2008, ECIR.

[19]  Dan Suciu,et al.  Computing query probability with incidence algebras , 2010, PODS '10.

[20]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[21]  Christoph Koch,et al.  PIP: A database system for great and small expectations , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[22]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  Guy Van den Broeck,et al.  Lifted Relax, Compensate and then Recover: From Approximate to Exact Lifted Probabilistic Inference , 2012, UAI.

[24]  Pedro M. Domingos,et al.  Approximation by Quantization , 2011, UAI.

[25]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[26]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[27]  David Poole,et al.  First-order probabilistic inference , 2003, IJCAI.

[28]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[29]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  Dan Suciu,et al.  Bridging the gap between intensional and extensional query evaluation in probabilistic databases , 2010, EDBT '10.

[31]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[32]  Guy Van den Broeck,et al.  Liftability of Probabilistic Inference: Upper and Lower Bounds , 2012 .

[33]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[34]  Adnan Darwiche,et al.  Clone: Solving Weighted Max-SAT in a Reduced Search Space , 2007, Australian Conference on Artificial Intelligence.

[35]  Hector Geffner,et al.  Structural Relaxations by Variable Renaming and Their Compilation for Solving MinCostSAT , 2007, CP.

[36]  Dan Roth,et al.  On the Hardness of Approximate Reasoning , 1993, IJCAI.

[37]  Dan Suciu,et al.  Probabilistic Databases with MarkoViews , 2012, Proc. VLDB Endow..

[38]  Pedro M. Domingos,et al.  Formula-Based Probabilistic Inference , 2010, UAI.

[39]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[40]  Bart Selman,et al.  Model Counting , 2021, Handbook of Satisfiability.

[41]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[42]  John N. Hooker,et al.  A Constraint Store Based on Multivalued Decision Diagrams , 2007, CP.

[43]  Laks V. S. Lakshmanan,et al.  Learning influence probabilities in social networks , 2010, WSDM '10.

[44]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[45]  Dan Olteanu,et al.  Using OBDDs for Efficient Query Evaluation on Probabilistic Databases , 2008, SUM.

[46]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[47]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[48]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[49]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[50]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[51]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[52]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[53]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[54]  David Bergman,et al.  Manipulating MDD Relaxations for Combinatorial Optimization , 2011, CPAIOR.

[55]  Adnan Darwiche,et al.  Node Splitting: A Scheme for Generating Upper Bounds in Bayesian Networks , 2007, UAI.

[56]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[57]  Dan Suciu,et al.  Integrating and Ranking Uncertain Scientific Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[58]  Vibhav Gogate,et al.  SampleSearch: Importance sampling in presence of determinism , 2011, Artif. Intell..

[59]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[60]  Patrick Bosc,et al.  About projection-selection-join queries addressed to possibilistic relational databases , 2005, IEEE Transactions on Fuzzy Systems.

[61]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[62]  Rina Dechter,et al.  Mini-buckets: A general scheme for bounded inference , 2003, JACM.

[63]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[64]  Luc De Raedt,et al.  Lifted Probabilistic Inference by First-Order Knowledge Compilation , 2011, IJCAI.

[65]  Henri Prade,et al.  A Model Based on Possibilistic Certainty Levels for Incomplete Databases , 2009, SUM.

[66]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[67]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[68]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.