Dissociation and propagation for approximate lifted inference with standard relational database management systems

Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

[1]  Guy Van den Broeck,et al.  Lifted probabilistic inference in relational models (UAI tutorial) , 2014 .

[2]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[3]  Dan Olteanu,et al.  Anytime approximation in probabilistic databases , 2013, The VLDB Journal.

[4]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[5]  Dan Suciu,et al.  Bridging the gap between intensional and extensional query evaluation in probabilistic databases , 2010, EDBT '10.

[6]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[7]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[8]  Dan Suciu,et al.  Probabilistic Databases with MarkoViews , 2012, Proc. VLDB Endow..

[9]  Pedro M. Domingos,et al.  Formula-Based Probabilistic Inference , 2010, UAI.

[10]  Chris Jermaine,et al.  Sampling-based estimators for subset-based queries , 2008, The VLDB Journal.

[11]  Peter L. Hammer,et al.  Boolean Functions - Theory, Algorithms, and Applications , 2011, Encyclopedia of mathematics and its applications.

[12]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[13]  Guy Van den Broeck,et al.  Lifted Relax, Compensate and then Recover: From Approximate to Exact Lifted Probabilistic Inference , 2012, UAI.

[14]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[15]  Guy Van den Broeck,et al.  Liftability of Probabilistic Inference: Upper and Lower Bounds , 2012 .

[16]  Dan Olteanu,et al.  Using OBDDs for Efficient Query Evaluation on Probabilistic Databases , 2008, SUM.

[17]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[18]  David Bergman,et al.  Optimization Bounds from Binary Decision Diagrams - (Extended Abstract) , 2014, CP.

[19]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[20]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[21]  Dan Olteanu,et al.  On the optimal approximation of queries using tractable propositional languages , 2011, ICDT '11.

[22]  Dan Olteanu,et al.  MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[24]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[25]  Guy Van den Broeck,et al.  Skolemization for Weighted First-Order Model Counting , 2013, KR.

[26]  David Poole,et al.  First-order probabilistic inference , 2003, IJCAI.

[27]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[28]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[29]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  Martin Theobald,et al.  Top-k query processing in probabilistic databases with non-materialized views , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[32]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[33]  Dan Olteanu,et al.  A dichotomy for non-repeating queries with negation in probabilistic databases , 2014, PODS.

[34]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[35]  Hilary Putnam,et al.  A Computing Procedure for Quantification Theory , 1960, JACM.

[36]  Bart Selman,et al.  Model Counting , 2021, Handbook of Satisfiability.

[37]  Dan Roth,et al.  On the Hardness of Approximate Reasoning , 1993, IJCAI.

[38]  Christopher Ré,et al.  Towards high-throughput gibbs sampling at scale: a study across storage managers , 2013, SIGMOD '13.

[39]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[40]  Lise Getoor,et al.  Read-once functions and query evaluation in probabilistic databases , 2010, Proc. VLDB Endow..

[41]  Val Tannen,et al.  Faster query answering in probabilistic databases using read-once functions , 2010, ICDT '11.

[42]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[43]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[44]  Tova Milo,et al.  Deriving probabilistic databases with inference ensembles , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[45]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[46]  Luc De Raedt,et al.  Lifted Probabilistic Inference by First-Order Knowledge Compilation , 2011, IJCAI.

[47]  Jerry Li,et al.  Exact Model Counting of Query Expressions , 2017, ACM Trans. Database Syst..

[48]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[49]  Pedro M. Domingos,et al.  Probabilistic theorem proving , 2011, UAI.

[50]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[51]  David Bergman,et al.  Manipulating MDD Relaxations for Combinatorial Optimization , 2011, CPAIOR.

[52]  Adnan Darwiche,et al.  Node Splitting: A Scheme for Generating Upper Bounds in Bayesian Networks , 2007, UAI.

[53]  Dan Suciu,et al.  SlimShot: In-Database Probabilistic Inference for Knowledge Bases , 2016, Proc. VLDB Endow..

[54]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[55]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Dan Suciu,et al.  Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases , 2013, MUD.

[57]  H. Chertkow,et al.  Semantic memory , 2002, Current neurology and neuroscience reports.

[58]  Daisy Zhe Wang,et al.  Knowledge expansion over probabilistic knowledge bases , 2014, SIGMOD Conference.

[59]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[60]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[61]  Dan Suciu,et al.  Integrating and Ranking Uncertain Scientific Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[62]  Vibhav Gogate,et al.  SampleSearch: Importance sampling in presence of determinism , 2011, Artif. Intell..

[63]  Guillaume Bouchard,et al.  Iterative Splits of Quadratic Bounds for Scalable Binary Tensor Factorization , 2014, UAI.

[64]  N. J. A. Sloane,et al.  The On-Line Encyclopedia of Integer Sequences , 2003, Electron. J. Comb..

[65]  Dan Suciu,et al.  The dichotomy of probabilistic inference for unions of conjunctive queries , 2012, JACM.

[66]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[67]  Dan Suciu,et al.  Approximate Lifted Inference with Probabilistic Databases , 2014, Proc. VLDB Endow..

[68]  Dan Suciu,et al.  Oblivious bounds on the probability of boolean functions , 2014, ACM Trans. Database Syst..

[69]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[70]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[71]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[72]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[73]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[74]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[75]  Tova Milo,et al.  Uncertainty in Crowd Data Sourcing Under Structural Constraints , 2014, DASFAA Workshops.

[76]  Marc Najork,et al.  Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores , 2008, ECIR.

[77]  Danai Koutra,et al.  Linearized and Single-Pass Belief Propagation , 2014, Proc. VLDB Endow..

[78]  Dan Suciu,et al.  Computing query probability with incidence algebras , 2010, PODS '10.

[79]  Christoph Koch,et al.  PIP: A database system for great and small expectations , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[80]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[81]  Neil Immerman,et al.  The Complexity of Resilience and Responsibility for Self-Join-Free Conjunctive Queries , 2015, Proc. VLDB Endow..

[82]  John N. Hooker,et al.  A Constraint Store Based on Multivalued Decision Diagrams , 2007, CP.

[83]  Laks V. S. Lakshmanan,et al.  Learning influence probabilities in social networks , 2010, WSDM '10.

[84]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[85]  Subbarao Kambhampati,et al.  Bayesian networks for supporting query processing over incomplete autonomous databases , 2012, Journal of Intelligent Information Systems.