Distantly supervised Web relation extraction for knowledge base population

Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, one of those approaches being distant supervision. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. In this paper we propose the use of distant supervision for relation extraction from the Web. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co- reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%.

[1]  Hiroshi Nakagawa,et al.  Reducing Wrong Labels in Distant Supervision for Relation Extraction , 2012, ACL.

[2]  Diego Reforgiato Recupero,et al.  Uncovering the Semantics of Wikipedia Pagelinks , 2014, EKAW.

[3]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[4]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[5]  Stephen Clark,et al.  Application-Driven Relation Extraction with Limited Distant Supervision , 2014 .

[6]  Isabelle Augenstein Seed Selection for Distantly Supervised Web-Based Relation Extraction , 2014, SWAIE@COLING.

[7]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[8]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[11]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[12]  Isabelle Augenstein,et al.  Unsupervised wrapper induction using linked data , 2013, K-CAP.

[13]  Andrew McCallum,et al.  Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[14]  Rahul Gupta,et al.  Knowledge base completion via search-based question answering , 2014, WWW.

[15]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[16]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[17]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[18]  Aldo Gangemi,et al.  Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames , 2012, EKAW.

[19]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[20]  Jens Lehmann,et al.  Template-based question answering over RDF data , 2012, WWW.

[21]  Axel-Cyrille Ngonga Ngomo,et al.  Extracting Multilingual Natural-Language Patterns for RDF Predicates , 2012, EKAW.

[22]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[23]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[24]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[25]  Ralph Grishman,et al.  Distant Supervision for Relation Extraction with an Incomplete Knowledge Base , 2013, NAACL.

[26]  Alessandro Moschitti,et al.  End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories , 2011, ACL.

[27]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[28]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[29]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[31]  Isabelle Augenstein,et al.  Relation Extraction from the Web Using Distant Supervision , 2014, EKAW.

[32]  Isabelle Augenstein,et al.  LODifier: Generating Linked Data from Unstructured Text , 2012, ESWC.

[33]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[34]  Mark Stevenson,et al.  Self-supervised Relation Extraction Using UMLS , 2014, CLEF.

[35]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[36]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[37]  Isabelle Augenstein,et al.  Joint Information Extraction from the Web Using Linked Data , 2014, SEMWEB.

[38]  Dekang Lin,et al.  Bootstrapping Path-Based Pronoun Resolution , 2006, ACL.

[39]  Le Zhao,et al.  Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction , 2013, ACL.

[40]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[41]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[42]  Dietrich Klakow,et al.  A survey of noise reduction methods for distant supervision , 2013, AKBC '13.

[43]  Enrique Alfonseca,et al.  Pattern Learning for Relation Extraction with a Hierarchical Topic Model , 2012, ACL.

[44]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[45]  Dietrich Klakow,et al.  Combining Generative and Discriminative Model Scores for Distant Supervision , 2013, EMNLP.