OPIEC: An Open Information Extraction Corpus

Open information extraction (OIE) systems extract relations and their arguments from natural language text in an unsupervised manner. The resulting extractions are a valuable resource for downstream tasks such as knowledge base construction, open question answering, or event schema induction. In this paper, we release, describe, and analyze an OIE corpus called OPIEC, which was extracted from the text of English Wikipedia. OPIEC complements the available OIE resources: It is the largest OIE corpus publicly available to date (over 340M triples) and contains valuable metadata such as provenance information, confidence scores, linguistic annotations, and semantic annotations including spatial and temporal information. We analyze the OPIEC corpus by comparing its content with knowledge bases such as DBpedia or YAGO, which are also based on Wikipedia. We found that most of the facts between entities present in OPIEC cannot be found in DBpedia and/or YAGO, that OIE facts often differ in the level of specificity compared to knowledge base facts, and that OIE open relations are generally highly polysemous. We believe that the OPIEC corpus is a valuable resource for future research on automated knowledge base construction.

[1]  Oren Etzioni,et al.  Entity Linking at Web Scale , 2012, AKBC-WEKEX@NAACL-HLT.

[2]  Roberto Navigli,et al.  Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis , 2015, TACL.

[3]  Reynold Xin,et al.  Apache Spark , 2016 .

[4]  Roberto Navigli,et al.  WiSeNet: building a wikipedia-based semantic network with ontologized relations , 2012, CIKM '12.

[5]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[6]  Oren Etzioni,et al.  Rel-grams: A Probabilistic Model of Relations in Text , 2012, AKBC-WEKEX@NAACL-HLT.

[7]  Roberto Navigli,et al.  Knowledge Base Unification via Sense Embeddings and Disambiguation , 2015, EMNLP.

[8]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[9]  Ido Dagan,et al.  Creating a Large Benchmark for Open Information Extraction , 2016, EMNLP.

[10]  Mausam,et al.  Knowledge-Guided Linguistic Rewrites for Inference Rule Verification , 2016, NAACL.

[11]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[12]  Partha Talukdar,et al.  CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information , 2018, WWW.

[13]  Roberto Navigli,et al.  Integrating Syntactic and Semantic Analysis into the Open Information Extraction Paradigm , 2013, IJCAI.

[14]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[15]  Oren Etzioni,et al.  Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[16]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[17]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[18]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[19]  Andrew McCallum,et al.  Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[20]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[21]  Mausam,et al.  Open Information Extraction Systems and Downstream Applications , 2016, IJCAI.

[22]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[23]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[24]  Zhiyong Wu,et al.  Towards Practical Open Knowledge Base Canonicalization , 2018, CIKM.

[25]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[26]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[27]  Tim Weninger,et al.  Open-World Knowledge Graph Completion , 2017, AAAI.

[28]  Jingyi Zhang,et al.  EAL: A Toolkit and Dataset for Entity-Aspect Linking , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[29]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[30]  Ido Dagan,et al.  Open IE as an Intermediate Structure for Semantic Tasks , 2015, ACL.

[31]  Mohamed Yahya,et al.  ReNoun: Fact Extraction for Nominal Attributes , 2014, EMNLP.

[32]  Luciano Del Corro,et al.  MinIE: Minimizing Facts in Open Information Extraction , 2017, EMNLP.

[33]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[34]  Oren Etzioni,et al.  Generating Coherent Event Schemas at Scale , 2013, EMNLP.