RealText-lex: A Lexicalization Framework for RDF Triples

Abstract The online era has made available almost cosmic amounts of information in the public and semi-restricted domains, prompting development of corresponding host of technologies to organize and navigate this information. One of these developing technologies deals with encoding information from free form natural language into a structured form as RDF triples. This representation enables machine processing of the data, however the processed information can not be directly converted back to human language. This has created a need to be able to lexicalize machine processed data existing as triples into a natural language, so that there is seamless transition between machine representation of information and information meant for human consumption. This paper presents a framework to lexicalize RDF triples extracted from DBpedia, a central interlinking hub for the emerging Web of Data. The framework comprises of four pattern mining modules which generate lexicalization patterns to transform triples to natural language sentences. Among these modules, three are based on lexicons and the other works on extracting relations by exploiting unstructured text to generate lexicalization patterns. A linguistic accuracy evaluation and a human evaluation on a sub-sample showed that the framework can produce patterns which are accurate and emanate human generated qualities.

[1]  Stephan Busemann Ten Years After: An Update on TG/2 (and Friends) , 2005, ENLG.

[2]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[3]  Elena Cabrio,et al.  6th Open Challenge on Question Answering over Linked Data (QALD-6) , 2016, SemWebEval@ESWC.

[4]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[5]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[6]  F. Schäfer Naturally atomic er-nominalizations , 2011 .

[7]  Daniel Duma,et al.  Generating Natural Language from Linked Data: Unsupervised template extraction , 2013, IWCS.

[8]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[9]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[10]  Marie-Francine Moens,et al.  Information Extraction: Algorithms and Prospects in a Retrieval Context , 2006, The Information Retrieval Series.

[11]  Peggy Aycinena Access for all , 2008, CACM.

[12]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[13]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[14]  Elena Cabrio,et al.  Question Answering over Linked Data (QALD-5) , 2014, CLEF.

[15]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[16]  Neville Ryant,et al.  A large-scale classification of English verbs , 2008, Lang. Resour. Evaluation.

[17]  Z. Kövecses,et al.  Metonymy: Developing a cognitive linguistic view , 1998 .

[18]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[19]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[20]  Andreas Harth,et al.  A language-independent method for the extraction of RDF verbalization templates , 2014, INLG.

[21]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[22]  Philipp Cimiano,et al.  A Corpus-Based Approach for the Induction of Ontology Lexica , 2013, NLDB.

[23]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[24]  Kazuya Wada,et al.  Spatial Characteristics of Long-term Changes in Indian Agricultural Production: District-Level Analysis, 1965-2007 , 2015 .

[25]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.