From RDF to Natural Language and Back

Most knowledge sources on the Data Web were extracted from structured or semistructured data sources. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this chapter, we present Bootstrapping Linked Data (BOA), a framework that aims to facilitate the extraction of Resource Description Framework (RDF) from text. The idea behind BOA is to extract natural language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web. These patterns are then used to extract instance knowledge from unstructured data sources. This knowledge can finally be fed back into the Data Web. The approach followed by BOA is quasi-independent of the language in which the corpus is written. We demonstrate our approach by applying it to four different corpora and two different languages. We evaluate BOA on these data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with high accuracy. Moreover, we provide the first multilingual repository of natural language representations (NLR) of predicates found on the Data Web. Finally, we present two applications of the natural language patterns generated by BOA, i.e., the fact validation framework DeFacto and the question answering engine Template - based SPARQL Learner (TBSL).

[1]  Philipp Cimiano,et al.  A lemon lexicon for DBpedia , 2013, NLP-DBPEDIA@ISWC.

[2]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[3]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[4]  Andreas Kohn,et al.  Function-Based Solution Retrieval and Semantic Search in Mechanical Engineering , 2009 .

[5]  Axel-Cyrille Ngonga Ngomo,et al.  SCMS - Semantifying Content Management Systems , 2011, SEMWEB.

[6]  Axel-Cyrille Ngonga Ngomo,et al.  Extracting Multilingual Natural-Language Patterns for RDF Predicates , 2012, EKAW.

[7]  Jens Lehmann,et al.  DeFacto - Deep Fact Validation , 2012, SEMWEB.

[8]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[9]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[10]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[11]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[12]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[13]  Christopher D. Manning,et al.  Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data , 2010, ACL.

[14]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[15]  Satoshi Nakamura,et al.  Trustworthiness Analysis of Web Search Results , 2007, ECDL.

[16]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[17]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.