ScaLeKB: scalable learning and inference over large knowledge bases

Recent years have seen a drastic rise in the construction of web knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge, web corpora, and information extraction algorithms, the knowledge bases are still far from complete. To infer the missing knowledge, we propose the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from these web knowledge bases. The OP algorithm scales up via a series of optimization techniques, including a new parallel-rule-mining algorithm, a pruning strategy to eliminate unsound and inefficient rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 h; no existing system achieves this scale.Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 h. We use cross validation to evaluate the inferred facts and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approach outperforms state-of-the-art mining algorithms and inference engines in terms of both performance and quality.

[1]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[2]  Christopher Ré,et al.  It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins , 2016, ICDT.

[3]  Alfred Horn,et al.  On sentences which are true of direct unions of algebras , 1951, Journal of Symbolic Logic.

[4]  Tuyen N. Huynh Discriminative Learning with Markov Logic Networks , 2009 .

[5]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[6]  Fabian M. Suchanek,et al.  Inside YAGO2s: a transparent information extraction architecture , 2013, WWW '13 Companion.

[7]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[8]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[9]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[10]  Tuyen N. Huynh,et al.  Structure Learning for Markov Logic Networks , 2018 .

[11]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[12]  B. Richards Learning Relations by Bathfinding , 1999 .

[13]  Christan Earl Grant,et al.  Efficient In-Database Analytics with Graphical Models , 2014, IEEE Data Eng. Bull..

[14]  Fabian M. Suchanek,et al.  Fast rule mining in ontological knowledge bases with AMIE+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{docu , 2015, The VLDB Journal.

[15]  Luc De Raedt,et al.  Bayesian Logic Programming: Theory and Tool , 2007 .

[16]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm , 2012, 1210.0481.

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Doug Downey,et al.  It’s a Contradiction – no, it’s not: A Case Study using Functional Relations , 2008, EMNLP.

[19]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[20]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[21]  Ni Lao,et al.  Reading The Web with Learned Syntactic-Semantic Inference Rules , 2012, EMNLP.

[22]  Stephen Muggleton Inductive Logic Programming: Derivations, Successes and Shortcomings , 1993, ECML.

[23]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[24]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[25]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[26]  Georg Gottlob,et al.  Size and treewidth bounds for conjunctive queries , 2009, JACM.

[27]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[28]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[29]  Dan Suciu,et al.  Computing Join Queries with Functional Dependencies , 2016, PODS.

[30]  Tom M. Mitchell,et al.  PIDGIN: ontology alignment using web text as interlingua , 2013, CIKM.

[31]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[32]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[33]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[34]  Yang Chen,et al.  Ontological Pathfinding : Mining First-Order Knowledge from Large Knowledge Bases , 2016 .

[35]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[36]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[37]  Jignesh M. Patel,et al.  QuickFOIL: Scalable Inductive Logic Programming , 2014, Proc. VLDB Endow..

[38]  Oren Etzioni,et al.  Learning First-Order Horn Clauses from Web Text , 2010, EMNLP.

[39]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[40]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[41]  Raymond J. Mooney,et al.  Online Inference-Rule Learning from Natural-Language Extractions , 2013, StarAI@AAAI.

[42]  Christopher Ré,et al.  It’s All a Matter of Degree , 2017, Theory of Computing Systems.

[43]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[44]  Kun Li,et al.  UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics , 2015, Proc. VLDB Endow..

[45]  Yu Cheng,et al.  GLADE: big data analytics made easy , 2012, SIGMOD Conference.

[46]  Pedro M. Domingos,et al.  Structure learning in markov logic networks , 2010 .

[47]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[48]  Christopher Ré,et al.  Scaling Inference for Markov Logic via Dual Decomposition , 2012, 2012 IEEE 12th International Conference on Data Mining.

[49]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[50]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[51]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[52]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[53]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[54]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.

[55]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[56]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[57]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[58]  Daisy Zhe Wang,et al.  Knowledge expansion over probabilistic knowledge bases , 2014, SIGMOD Conference.

[59]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[60]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[61]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[62]  Birgit Tausend,et al.  Representing Biases for Inductive Logic Programming , 1994, ECML.

[63]  Feng Niu,et al.  Scaling Inference for Markov Logic with a Task-Decomposition Approach , 2011, 1108.0294.

[64]  Daisy Zhe Wang,et al.  Ontological Pathfinding , 2016, SIGMOD Conference.

[65]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[66]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[67]  Milenko Petrovic,et al.  SemMemDB: In-Database Knowledge Activation , 2014, FLAIRS Conference.

[68]  Daisy Zhe Wang,et al.  Hybrid in-database inference for declarative information extraction , 2011, SIGMOD '11.

[69]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[70]  Oren Etzioni,et al.  Identifying Functional Relations in Web Text , 2010, EMNLP.

[71]  Dániel Marx,et al.  Size Bounds and Query Plans for Relational Joins , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[72]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[73]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[74]  Jian Pei,et al.  Mining frequent patterns by pattern-growth: methodology and implications , 2000, SKDD.

[75]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[76]  Oren Etzioni,et al.  Scaling Textual Inference to the Web , 2008, EMNLP.

[77]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[78]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[79]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[80]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[81]  Ce Zhang,et al.  DeepDive: A Data Management System for Automatic Knowledge Base Construction , 2015 .

[82]  Rahul Gupta,et al.  Knowledge base completion via search-based question answering , 2014, WWW.