Ontological Pathfinding : Mining First-Order Knowledge from Large Knowledge Bases

Recent years have seen a drastic rise in the construction of webscale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, these knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale.

[1]  Alfred Horn,et al.  On sentences which are true of direct unions of algebras , 1951, Journal of Symbolic Logic.

[2]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[3]  Raymond J. Mooney,et al.  Learning Relations by Pathfinding , 1992, AAAI.

[4]  Stephen Muggleton Inductive Logic Programming: Derivations, Successes and Shortcomings , 1993, ECML.

[5]  Birgit Tausend,et al.  Representing Biases for Inductive Logic Programming , 1994, ECML.

[6]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[7]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[8]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[9]  Peter Clark,et al.  A Knowledge-Based Approach to Question-Answering , 1999 .

[10]  Jian Pei,et al.  Mining frequent patterns by pattern-growth: methodology and implications , 2000, SKDD.

[11]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[14]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[15]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[16]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[17]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[18]  Doug Downey,et al.  It’s a Contradiction – no, it’s not: A Case Study using Functional Relations , 2008, EMNLP.

[19]  Oren Etzioni,et al.  Scaling Textual Inference to the Web , 2008, EMNLP.

[20]  Tuyen N. Huynh Discriminative Learning with Markov Logic Networks , 2009 .

[21]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[22]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[23]  Pedro M. Domingos,et al.  Structure learning in markov logic networks , 2010 .

[24]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[25]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[26]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[27]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[28]  Oren Etzioni,et al.  Learning First-Order Horn Clauses from Web Text , 2010, EMNLP.

[29]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[30]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[31]  Oren Etzioni,et al.  Identifying Functional Relations in Web Text , 2010, EMNLP.

[32]  Daisy Zhe Wang,et al.  Hybrid in-database inference for declarative information extraction , 2011, SIGMOD '11.

[33]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.

[34]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[35]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[36]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[37]  Yu Cheng,et al.  GLADE: big data analytics made easy , 2012, SIGMOD Conference.

[38]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[39]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[40]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[41]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[42]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[43]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[44]  Tom M. Mitchell,et al.  PIDGIN: ontology alignment using web text as interlingua , 2013, CIKM.

[45]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[46]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[47]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[48]  Fabian M. Suchanek,et al.  Inside YAGO2s: a transparent information extraction architecture , 2013, WWW '13 Companion.

[49]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[50]  Christan Earl Grant,et al.  Efficient In-Database Analytics with Graphical Models , 2014, IEEE Data Eng. Bull..

[51]  Daisy Zhe Wang,et al.  Knowledge expansion over probabilistic knowledge bases , 2014, SIGMOD Conference.

[52]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[53]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[54]  Kun Li,et al.  UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics , 2015, Proc. VLDB Endow..

[55]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.