N-ary relation extraction for simultaneous T-Box and A-Box knowledge base augmentation

The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one still being the free-text document. This motivates the need for Intelligent Web-reading Agents: hypothetically, they would skim through disparate Web sources corpora and generate meaningful structured assertions to fuel Knowledge Bases (KBs). Ultimately, comprehensive KBs, like WIKIDATA and DBPEDIA, play a fundamental role to cope with the issue of information overload. On account of such vision, this paper depicts the FACT EXTRACTOR, a complete Natural Language Processing (NLP) pipeline which reads an input textual corpus and produces machine-readable statements. Each statement is supplied with a confidence score and undergoes a disambiguation step via Entity Linking, thus allowing the assignment of KB-compliant URIs. The system implements four research contributions: it (1) executes N-ary relation extraction by applying the Frame Semantics linguistic theory, as opposed to binary techniques; it (2) simultaneously populates both the T-Box and the A-Box of the target KB; it (3) relies on a single NLP layer, namely part-of-speech tagging; it (4) enables a completely supervised yet reasonably priced machine learning environment through a crowdsourcing strategy. We assess our approach by setting the target KB to DBpedia and by considering a use case of 52, 000 Italian Wikipedia soccer player articles. Out of those, we yield a dataset of more than 213, 000 triples with an estimated 81.27% F1. We corroborate the evaluation via (i) a performance comparison with a baseline system, as well as (ii) an analysis of the T-Box and A-Box augmentation capabilities. The outcomes are incorporated into the Italian DBpedia chapter, can be queried through its SPARQL endpoint, and/or downloaded as standalone data dumps. The codebase is released as free software and is publicly available in the DBpedia Association repository.

[1]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[2]  Diego Reforgiato Recupero,et al.  Uncovering the Semantics of Wikipedia Pagelinks , 2014, EKAW.

[3]  Xavier Carreras,et al.  Semantic Role Labeling: An Introduction to the Special Issue , 2008, Computational Linguistics.

[4]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[5]  Thomas Schmidt The Kicktionary revisited , 2008, KONVENS.

[6]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[7]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[8]  Isabelle Augenstein,et al.  Relation Extraction from the Web Using Distant Supervision , 2014, EKAW.

[9]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.

[10]  Richard Johansson,et al.  LTH: Semantic Structure Extraction using Nonprojective Dependency Trees , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[11]  Richard Johansson,et al.  Dependency-based Semantic Role Labeling of PropBank , 2008, EMNLP.

[12]  Mark Steedman,et al.  Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL , 2013, EMNLP.

[13]  Jisup Hong,et al.  How Good is the Crowd at "real" WSD? , 2011, Linguistic Annotation Workshop.

[14]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[15]  Diego Reforgiato Recupero,et al.  From hyperlinks to Semantic Web properties using Open Knowledge Extraction , 2016, Semantic Web.

[16]  Carlo Strapparava,et al.  Kernel Methods for Minimally Supervised WSD , 2009, CL.

[17]  David Huynh,et al.  Scaling Semantic Frame Annotation , 2015, LAW@NAACL-HLT.

[18]  Andrea Giovanni Nuzzolese,et al.  Gathering lexical linked data and knowledge patterns from FrameNet , 2011, K-CAP '11.

[19]  Gerard de Melo,et al.  FrameBase: Enabling integration of heterogeneous knowledge , 2017, Semantic Web.

[20]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[21]  Noah A. Smith,et al.  Frame-Semantic Parsing , 2014, CL.

[22]  Gerard de Melo,et al.  Integrating Heterogeneous Knowledge with FrameBase , 2015 .

[23]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[24]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[25]  Aldo Gangemi,et al.  Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames , 2012, EKAW.

[26]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[27]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[28]  Oren Etzioni,et al.  Machine Reading , 2006, AAAI.

[29]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[30]  Heiko Paulheim,et al.  Type Inference on Noisy RDF Data , 2013, SEMWEB.

[31]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[32]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[33]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[34]  Dan Roth,et al.  The Importance of Syntactic Parsing and Inference in Semantic Role Labeling , 2008, CL.

[35]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[36]  Pierre Nugues,et al.  Multilingual Semantic Role Labeling , 2009, CoNLL Shared Task.

[37]  Heiner Stuckenschmidt,et al.  Enriching Structured Knowledge with Open Information , 2015, WWW.

[38]  Andrea Giovanni Nuzzolese,et al.  Automatic Typing of DBpedia Entities , 2012, SEMWEB.

[39]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[40]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[41]  Markus Krötzsch,et al.  Reifying RDF: What Works Well With Wikidata? , 2015, SSWS@ISWC.

[42]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[43]  Rinke Hoekstra,et al.  Ontology Representation - Design Patterns and Ontologies that Make Sense , 2009, Frontiers in Artificial Intelligence and Applications.

[44]  Mark Steedman,et al.  Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL , 2011, EMNLP.

[45]  Andrea Giovanni Nuzzolese,et al.  Fine-tuning triplification with Semion , 2010 .

[46]  Aldo Gangemi,et al.  Frame Detection over the Semantic Web , 2009, ESWC.

[47]  Aleksander Pohl Classifying the Wikipedia Articles into the OpenCyc Taxonomy , 2012, WoLE@ISWC.

[48]  Aldo Gangemi,et al.  Towards a pattern science for the Semantic Web , 2010, Semantic Web.

[49]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[50]  Jaime G. Carbonell,et al.  Frame-Semantic Role Labeling with Heterogeneous Annotations , 2015, ACL.

[51]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[52]  Shankar Kumar,et al.  Multilingual Open Relation Extraction Using Cross-lingual Projection , 2015, NAACL.

[53]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[54]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[55]  Collin F. Baker FrameNet: A Knowledge Base for Natural Language Processing , 2014 .

[56]  Collin F. Baker FrameNet, current collaborations and future goals , 2012, Lang. Resour. Evaluation.

[57]  Diego Reforgiato Recupero,et al.  Semantic Web Machine Reading with FRED , 2017, Semantic Web.

[58]  Gerard de Melo,et al.  FrameBase: Representing N-Ary Relations Using Semantic Frames , 2015, ESWC.

[59]  Amit P. Sheth,et al.  Don't like RDF reification?: making statements about statements using singleton property , 2014, WWW.

[60]  Michael Günther,et al.  Introducing Wikidata to the Linked Data Web , 2014, SEMWEB.

[61]  Denilson Barbosa,et al.  Effectiveness and Efficiency of Open Relation Extraction , 2013, EMNLP.

[62]  Christopher D. Manning,et al.  Combining Distant and Partial Supervision for Relation Extraction , 2014, EMNLP.

[63]  Fabio Vitali,et al.  Dealing with markup semantics , 2011, I-Semantics '11.

[64]  Claudio Giuliano,et al.  Outsourcing FrameNet to the Crowd , 2013, ACL.