A Large-Scale Multilingual Disambiguation of Glosses

Linking concepts and named entities to knowledge bases has become a crucial Natural Language Understanding task. In this respect, recent works have shown the key advantage of exploiting textual definitions in various Natural Language Processing applications. However, to date there are no reliable large-scale corpora of sense-annotated textual definitions available to the research community. In this paper we present a large-scale high-quality corpus of disambiguated glosses in multiple languages, comprising sense annotations of both concepts and named entities from a unified sense inventory. Our approach for the construction and disambiguation of the corpus builds upon the structure of a large multilingual semantic network and a state-of-the-art disambiguation system; first, we gather complementary information of equivalent definitions across different languages to provide context for disambiguation, and then we combine it with a semantic similarity-based refinement. As a result we obtain a multilingual corpus of textual definitions featuring over 38 million definitions in 263 languages, and we make it freely available at this http URL Experiments on Open Information Extraction and Sense Clustering show how two state-of-the-art approaches improve their performance by integrating our disambiguated corpus into their pipeline.

[1]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[2]  Paola Velardi,et al.  Learning Word-Class Lattices for Definition and Hypernym Extraction , 2010, ACL.

[3]  Partha Pratim Talukdar,et al.  Automatic Gloss Finding for a Knowledge Base using Ontological Constraints , 2015, WSDM.

[4]  Stefano Faralli,et al.  GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web , 2013, ACL.

[5]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[6]  Horacio Saggion,et al.  ExTaSem! Extending, Taxonomizing and Semantifying Domain Terminologies , 2016, AAAI.

[7]  Adrian Novischi Accurate Semantic Annotations via Pattern Matching , 2002, FLAIRS Conference.

[8]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[9]  Aitor Gonzalez-Agirre,et al.  A Graph-Based Method to Improve WordNet Domains , 2012, CICLing.

[10]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[11]  Lucy Vanderwende,et al.  MindNet: Acquiring and Structuring Semantic Information from Text , 1998, COLING-ACL.

[12]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[13]  Horacio Saggion,et al.  Applying Dependency Relations to Definition Extraction , 2014, NLDB.

[14]  Ken Litkowski Senseval-3 task: Word Sense Disambiguation of WordNet glosses , 2004, SENSEVAL@ACL.

[15]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[16]  Simone Paolo Ponzetto,et al.  Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[17]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[18]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[19]  Roberto Navigli,et al.  NASARI: a Novel Approach to a Semantically-Aware Representation of Items , 2015, NAACL.

[20]  Rada Mihalcea,et al.  Sense Clustering Using Wikipedia , 2013, RANLP.

[21]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[22]  Roberto Navigli,et al.  Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis , 2015, TACL.

[23]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[25]  Roberto Navigli,et al.  A Unified Multilingual Semantic Representation of Concepts , 2015, ACL.

[26]  Dan I. Moldovan,et al.  Word sense disambiguation of WordNet glosses , 2004, Comput. Speech Lang..

[27]  Tiziano Flati,et al.  Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project , 2014, ACL.

[28]  Stefano Faralli,et al.  OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction , 2013, CL.

[29]  Roberto Navigli,et al.  Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity , 2013, ACL.

[30]  Joakim Nivre,et al.  Towards a Universal Grammar for Natural Language Processing , 2015, CICLing.

[31]  Yoshua Bengio,et al.  Learning to Understand Phrases by Embedding the Dictionary , 2015, TACL.

[32]  Rada Mihalcea,et al.  Unsupervised Word Sense Disambiguation with Multilingual Representations , 2012, LREC.

[33]  Tao Chen,et al.  Improving Distributed Representation of Word Sense via WordNet Gloss Composition and Context Clustering , 2015, ACL.