Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS

Markov Logic Networks (MLNs) have emerged as a powerful framework that combines statistical and logical reasoning; they have been applied to many data intensive problems including information extraction, entity resolution, and text mining. Current implementations of MLNs do not scale to large real-world data sets, which is preventing their widespread adoption. We present Tuffy that achieves scalability via three novel contributions: (1) a bottom-up approach to grounding that allows us to leverage the full power of the relational optimizer, (2) a novel hybrid architecture that allows us to perform AI-style local search efficiently using an RDBMS, and (3) a theoretical insight that shows when one can (exponentially) improve the efficiency of stochastic local search. We leverage (3) to build novel partitioning, loading, and parallel algorithms. We show that our approach outperforms state-of-the-art implementations in both quality and speed on several publicly available datasets.

[1]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[2]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[5]  Sudipto Guha,et al.  Multi-armed Bandits with Metric Switching Costs , 2009, ICALP.

[6]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Pedro M. Domingos,et al.  Sound and Efficient Inference with Probabilistic and Deterministic Dependencies , 2006, AAAI.

[8]  Matthew Richardson,et al.  Speeding Up Inference in Statistical Relational Learning by Clustering Similar Query Literals , 2009, ILP.

[9]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[10]  Rahul Gupta,et al.  Efficient inference with cardinality-based clique potentials , 2007, ICML '07.

[11]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[12]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[13]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[14]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[15]  David Allen,et al.  New Advances in Inference by Recursive Conditioning , 2002, UAI.

[16]  Iván V. Meza,et al.  Collective Semantic Role Labelling with Markov Logic , 2008, CoNLL.

[17]  Adnan Darwiche,et al.  Uncertainty in artificial intelligence : proceedings of the nineteenth conference (2003), August 7-10, 2003, Acapulco, Mexico , 2003 .

[18]  Daniel Tarlow,et al.  Using Combinatorial Optimization within Max-Product Belief Propagation , 2006, NIPS.

[19]  Jiebo Luo,et al.  Discovery of social relationships in consumer photo collections using Markov Logic , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[20]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[21]  Charles L. Forgy,et al.  Rete: A Fast Algorithm for the Many Patterns/Many Objects Match Problem , 1982, Artif. Intell..

[22]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[23]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[24]  Subhash Khot,et al.  Ruling out PTAS for graph min-bisection, densest subgraph and bipartite clique , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[25]  Sriraam Natarajan,et al.  Speeding Up Inference in Markov Logic Networks by Preprocessing to Reduce the Size of the Resulting Grounded Network , 2009, IJCAI.

[26]  Gai CarSO A Logic for Reasoning about Probabilities * , 2004 .

[27]  Shang-Hua Teng,et al.  How Good is Recursive Bisection? , 1997, SIAM J. Sci. Comput..

[28]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[29]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[30]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[31]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[32]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[33]  Pedro M. Domingos,et al.  Memory-Efficient Inference in Relational Domains , 2006, AAAI.

[34]  Allen Van Gelder,et al.  Partitioning Methods for Satisfiability Testing on Large Formulas , 1996, CADE.

[35]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[36]  Pedro M. Domingos,et al.  Lifted First-Order Belief Propagation , 2008, AAAI.

[37]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[38]  Bart Selman,et al.  Towards Efficient Sampling: Exploiting Random Walk Strategies , 2004, AAAI.

[39]  Ronald Fagin,et al.  A logic for reasoning about probabilities , 1988, [1988] Proceedings. Third Annual Information Symposium on Logic in Computer Science.

[40]  Pedro M. Domingos,et al.  A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC , 2008, AAAI.

[41]  Charles L. Forgy,et al.  Rete: a fast algorithm for the many pattern/many object pattern match problem , 1991 .

[42]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[43]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[44]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[45]  Raymond J. Mooney,et al.  Bottom-up learning of Markov logic network structure , 2007, ICML '07.

[46]  Bart Selman,et al.  A general stochastic approach to solving problems with hard and soft constraints , 1996, Satisfiability Problem: Theory and Applications.

[47]  Robert P. Goldman,et al.  From knowledge bases to decision models , 1992, The Knowledge Engineering Review.

[48]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.