Automatic Population of Structured Knowledge Bases via Natural Language Processing

The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one still being the free-text document. This motivates the need for Intelligent Web-reading Agents: hypothetically, they would skim through disparate Web sources corpora and generate meaningful structured assertions to fuel Knowledge Bases (KBs). Ultimately, comprehensive KBs, like Wikidata and DBpedia, play a fundamental role to cope with the issue of information overload. On account of such vision, this thesis depicts a set of systems based on Natural Language Processing (NLP), which take as input unstructured or semi-structured information sources and produce machine-readable statements for a target KB. We implement four main research contributions: (1) a one-step methodology for crowdsourcing the Frame Semantics annotation; (2) a NLP technique implementing the above contribution to perform N-ary Relation Extraction from Wikipedia, thus enriching the target KB with properties; (3) a taxonomy learning strategy to produce an intuitive and exhaustive class hierarchy from the Wikipedia category graph, thus augmenting the target KB with classes; (4) a recommender system that leverages a KB network to yield atypical suggestions with detailed explanations, serving as a proof of work for real-world end users. The outcomes are incorporated into the Italian DBpedia chapter, can be queried through its public endpoint, and/or downloaded as standalone data dumps.

[1]  Thomas Schmidt The Kicktionary revisited , 2008, KONVENS.

[2]  Sean M. McNee,et al.  Being accurate is not enough: how accuracy metrics have hurt recommender systems , 2006, CHI Extended Abstracts.

[3]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[4]  Tiziano Flati,et al.  Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project , 2014, ACL.

[5]  Gerhard Weikum,et al.  LEILA: Learning to Extract Information by Linguistic Analysis , 2006, OntologyLearning@COLING/ACL.

[6]  Diego Reforgiato Recupero,et al.  Uncovering the Semantics of Wikipedia Pagelinks , 2014, EKAW.

[7]  Xavier Carreras,et al.  Semantic Role Labeling: An Introduction to the Special Issue , 2008, Computational Linguistics.

[8]  Alejandro Bellogín,et al.  A multilayer ontology-based hybrid recommendation model , 2008, AI Commun..

[9]  Thomas Pellissier Tanon,et al.  From Freebase to Wikidata: The Great Migration , 2016, WWW.

[10]  Gerard de Melo,et al.  Integrating Heterogeneous Knowledge with FrameBase , 2015 .

[11]  Nicoletta Calzolari,et al.  Working on the Italian Machine Dictionary: A Semantic Approach , 1973, COLING.

[12]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[13]  Heiko Paulheim,et al.  Type Inference on Noisy RDF Data , 2013, SEMWEB.

[14]  Gerard de Melo,et al.  FrameBase: Representing N-Ary Relations Using Semantic Frames , 2015, ESWC.

[15]  Fabio Vitali,et al.  Dealing with markup semantics , 2011, I-Semantics '11.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Amit P. Sheth,et al.  Don't like RDF reification?: making statements about statements using singleton property , 2014, WWW.

[18]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[19]  C. Pollard,et al.  Center for the Study of Language and Information , 2022 .

[20]  Simone Paolo Ponzetto,et al.  Taxonomy induction based on a collaboratively built knowledge repository , 2011, Artif. Intell..

[21]  Andrea Giovanni Nuzzolese,et al.  Fine-tuning triplification with Semion , 2010 .

[22]  Noah A. Smith,et al.  An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints , 2012, *SEMEVAL.

[23]  Aldo Gangemi,et al.  Frame Detection over the Semantic Web , 2009, ESWC.

[24]  Aleksander Pohl Classifying the Wikipedia Articles into the OpenCyc Taxonomy , 2012, WoLE@ISWC.

[25]  Noah A. Smith,et al.  Extracting Simplified Statements for Factual Question Generation , 2010 .

[26]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[27]  Pádraig Cunningham,et al.  An on-line evaluation framework for recommender systems , 2002 .

[28]  Bracha Shapira,et al.  AN ONTOLOGY-CONTENT-BASED FILTERING METHOD , 2008 .

[29]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[30]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[31]  James A. Hendler,et al.  TWC LOGD: A portal for linked open government data ecosystems , 2011, J. Web Semant..

[32]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[33]  Daniel J. Veit,et al.  More than fun and money. Worker Motivation in Crowdsourcing - A Study on Mechanical Turk , 2011, AMCIS.

[34]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[35]  Markus Krötzsch,et al.  Reifying RDF: What Works Well With Wikidata? , 2015, SSWS@ISWC.

[36]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[37]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[38]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[39]  Alejandro Bellogín,et al.  Ontology-Based Personalised and Context-Aware Recommendations of News Items , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[40]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[41]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[42]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[43]  Dimitris Kontokostas,et al.  Internationalization of Linked Data: The case of the Greek DBpedia edition , 2012, J. Web Semant..

[44]  Richard Johansson,et al.  Dependency-based Semantic Role Labeling of PropBank , 2008, EMNLP.

[45]  Philipp Cimiano,et al.  Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0 , 2014, LREC.

[46]  Jisup Hong,et al.  How Good is the Crowd at "real" WSD? , 2011, Linguistic Annotation Workshop.

[47]  David Huynh,et al.  Scaling Semantic Frame Annotation , 2015, LAW@NAACL-HLT.

[48]  Andrea Giovanni Nuzzolese,et al.  Gathering lexical linked data and knowledge patterns from FrameNet , 2011, K-CAP '11.

[49]  Diego Reforgiato Recupero,et al.  From hyperlinks to Semantic Web properties using Open Knowledge Extraction , 2016, Semantic Web.

[50]  Josef Ruppenhofer,et al.  FrameNet II: Extended theory and practice , 2006 .

[51]  Carlo Strapparava,et al.  Kernel Methods for Minimally Supervised WSD , 2009, CL.

[52]  Markus Zanker,et al.  Linked open data to support content-based recommender systems , 2012, I-SEMANTICS '12.

[53]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[54]  Dafna Shahaf,et al.  Connecting the dots between news articles , 2011, IJCAI 2011.

[55]  Praveen Paritosh,et al.  The anatomy of a large-scale human computation engine , 2010, HCOMP '10.

[56]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[57]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[58]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[59]  Aldo Gangemi,et al.  Towards a pattern science for the Semantic Web , 2010, Semantic Web.

[60]  Alejandro Bellogín,et al.  News@hand: A Semantic Web Approach to Recommending News , 2008, AH.

[61]  Michael Strube,et al.  Distinguishing between Instances and Classes in the Wikipedia Taxonomy , 2008, ESWC.

[62]  Gerhard Friedrich,et al.  Recommender Systems - An Introduction , 2010 .

[63]  Gerhard Weikum,et al.  MENTA: inducing multilingual taxonomies from wikipedia , 2010, CIKM '10.

[64]  Collin F. Baker FrameNet, current collaborations and future goals , 2012, Lang. Resour. Evaluation.

[65]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[66]  Diego Reforgiato Recupero,et al.  Semantic Web Machine Reading with FRED , 2017, Semantic Web.

[67]  Rinke Hoekstra,et al.  Ontology Representation - Design Patterns and Ontologies that Make Sense , 2009, Frontiers in Artificial Intelligence and Applications.

[68]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[69]  Bernhard Haslhofer,et al.  Augmenting Europeana content with linked data resources , 2010, I-SEMANTICS '10.

[70]  Manuel Blum,et al.  Verbosity: a game for collecting common-sense facts , 2006, CHI.

[71]  Pasquale Lops,et al.  Content-based Recommender Systems: State of the Art and Trends , 2011, Recommender Systems Handbook.

[72]  Emanuele Pianta,et al.  Extending English ACE 2005 Corpus Annotation with Ground-truth Links to Wikipedia , 2010, PWNLP@COLING.

[73]  Andrea Giovanni Nuzzolese,et al.  Automatic Typing of DBpedia Entities , 2012, SEMWEB.

[74]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[75]  Claudio Giuliano,et al.  Towards an Automatic Creation of Localized Versions of DBpedia , 2013, International Semantic Web Conference.

[76]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[77]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[78]  Roser Morante,et al.  SemEval-2010 Task 10: Linking Events and Their Participants in Discourse , 2009, SemEval@ACL.

[79]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[80]  Jackie Chi Kit Cheung,et al.  Probabilistic Frame Induction , 2013, NAACL.

[81]  Udo Kruschwitz,et al.  Phrase Detectives: A Web-based collaborative annotation game , 2008 .

[82]  Michael Strube,et al.  Transforming Wikipedia into a large scale multilingual concept network , 2013, Artif. Intell..

[83]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[84]  Claudio Giuliano,et al.  Wikipedia-based WSD for multilingual frame annotation , 2013, Artif. Intell..

[85]  Shankar Kumar,et al.  Multilingual Open Relation Extraction Using Cross-lingual Projection , 2015, NAACL.

[86]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[87]  Hiroaki Sato,et al.  The FrameNet Database and Software Tools , 2002, LREC.

[88]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[89]  Isabelle Augenstein,et al.  Relation Extraction from the Web Using Distant Supervision , 2014, EKAW.

[90]  Richard Johansson,et al.  LTH: Semantic Structure Extraction using Nonprojective Dependency Trees , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[91]  Claudio Giuliano,et al.  Automatic Expansion of DBpedia Exploiting Wikipedia Cross-Language Information , 2013, ESWC.

[92]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[93]  Aldo Gangemi,et al.  Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames , 2012, EKAW.

[94]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[95]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[96]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[97]  Zhaohui Zheng,et al.  Learning to model relatedness for news recommendation , 2011, WWW.

[98]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[99]  Jaime G. Carbonell,et al.  Frame-Semantic Role Labeling with Heterogeneous Annotations , 2015, ACL.

[100]  Carlo Strapparava,et al.  Ecological Evaluation of Persuasive Messages Using Google AdWords , 2012, ACL.

[101]  Michael Strube,et al.  WikiNet: A Very Large Scale Multi-Lingual Concept Network , 2010, LREC.

[102]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[103]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[104]  Mirella Lapata,et al.  Cross-lingual Annotation Projection for Semantic Roles , 2009, J. Artif. Intell. Res..

[105]  Ivo Lasek,et al.  DC Proposal: Model for News Filtering with Named Entities , 2011, SEMWEB.

[106]  Claudio Giuliano,et al.  Outsourcing FrameNet to the Crowd , 2013, ACL.

[107]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[108]  Michael Günther,et al.  Introducing Wikidata to the Linked Data Web , 2014, SEMWEB.

[109]  Denilson Barbosa,et al.  Effectiveness and Efficiency of Open Relation Extraction , 2013, EMNLP.

[110]  Christopher D. Manning,et al.  Combining Distant and Partial Supervision for Relation Extraction , 2014, EMNLP.

[111]  Matteo Negri,et al.  Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora , 2011, EMNLP.

[112]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[113]  Lars Schmidt-Thieme,et al.  Taxonomy-driven computation of product recommendations , 2004, CIKM '04.

[114]  Dan Roth,et al.  The Importance of Syntactic Parsing and Inference in Semantic Role Labeling , 2008, CL.

[115]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[116]  Yong Yu,et al.  TuneSensor: A Semantic-Driven Music Recommendation Service For Digital Photo Albums , 2011 .

[117]  Pierre Nugues,et al.  Multilingual Semantic Role Labeling , 2009, CoNLL Shared Task.

[118]  Heiner Stuckenschmidt,et al.  Enriching Structured Knowledge with Open Information , 2015, WWW.

[119]  Katrin Erk,et al.  SALTO - A Versatile Multi-Level Annotation Tool , 2006, LREC.

[120]  Udo Kruschwitz,et al.  Constructing an Anaphorically Annotated Corpus with Non-Experts: Assessing the Quality of Collaborative Annotations , 2009, PWNLP@IJCNLP.

[121]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[122]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.