SenseDefs: a multilingual corpus of semantically annotated textual definitions

Definitional knowledge has proved to be essential in various Natural Language Processing tasks and applications, especially when information at the level of word senses is exploited. However, the few sense-annotated corpora of textual definitions available to date are of limited size: this is mainly due to the expensive and time-consuming process of annotating a wide variety of word senses and entity mentions at a reasonably high scale. In this paper we present SenseDefs, a large-scale high-quality corpus of disambiguated definitions (or glosses) in multiple languages, comprising sense annotations of both concepts and named entities from a wide-coverage unified sense inventory. Our approach for the construction and disambiguation of this corpus builds upon the structure of a large multilingual semantic network and a state-of-the-art disambiguation system: first, we gather complementary information of equivalent definitions across different languages to provide context for disambiguation; then we refine the disambiguation output with a distributional approach based on semantic similarity. As a result, we obtain a multilingual corpus of textual definitions featuring over 38 million definitions in 263 languages, and we publicly release it to the research community. We assess the quality of SenseDefs’s sense annotations both intrinsically and extrinsically on Open Information Extraction and Sense Clustering tasks.

[1]  Hwee Tou Ng,et al.  One Million Sense-Tagged Instances for Word Sense Disambiguation and Induction , 2015, CoNLL.

[2]  Roberto Navigli,et al.  NASARI: a Novel Approach to a Semantically-Aware Representation of Items , 2015, NAACL.

[3]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[4]  Tao Chen,et al.  Improving Distributed Representation of Word Sense via WordNet Gloss Composition and Context Clustering , 2015, ACL.

[5]  Nigel Collier,et al.  Towards a Seamless Integration of Word Senses into Downstream NLP Applications , 2017, ACL.

[6]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[7]  Horacio Saggion,et al.  ExTaSem! Extending, Taxonomizing and Semantifying Domain Terminologies , 2016, AAAI.

[8]  Aitor Gonzalez-Agirre,et al.  A Graph-Based Method to Improve WordNet Domains , 2012, CICLing.

[9]  Jens Lehmann,et al.  Integrating NLP Using Linked Data , 2013, SEMWEB.

[10]  Adrian Novischi Accurate Semantic Annotations via Pattern Matching , 2002, FLAIRS Conference.

[11]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[12]  Wlodek Zadrozny,et al.  Measuring Semantic Relatedness using Mined Semantic Analysis , 2015, ArXiv.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Annalina Caputo,et al.  An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model , 2014, COLING.

[15]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[16]  Raphaël Troncy,et al.  GERBIL: General Entity Annotator Benchmarking Framework , 2015, WWW.

[17]  Scott Cotton,et al.  SENSEVAL-2: Overview , 2001, *SEMEVAL.

[18]  Roberto Navigli,et al.  Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison , 2017, EACL.

[19]  Stefano Faralli,et al.  OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction , 2013, CL.

[20]  Horacio Saggion,et al.  Supervised Distributional Hypernym Discovery via Domain Adaptation , 2016, EMNLP.

[21]  Hwee Tou Ng,et al.  It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text , 2010, ACL.

[22]  Stefano Faralli,et al.  A New Minimally-Supervised Framework for Domain Word Sense Disambiguation , 2012, EMNLP.

[23]  Stefano Faralli,et al.  GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web , 2013, ACL.

[24]  Roberto Navigli,et al.  A Large-Scale Multilingual Disambiguation of Glosses , 2016, LREC.

[25]  Roberto Navigli,et al.  SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking , 2015, *SEMEVAL.

[26]  Horacio Saggion,et al.  Applying Dependency Relations to Definition Extraction , 2014, NLDB.

[27]  Yoshua Bengio,et al.  Learning to Understand Phrases by Embedding the Dictionary , 2015, TACL.

[28]  Rada Mihalcea,et al.  Unsupervised Word Sense Disambiguation with Multilingual Representations , 2012, LREC.

[29]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[30]  Jens Lehmann,et al.  The German DBpedia: A Sense Repository for Linking Entities , 2012, Linked Data in Linguistics.

[31]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[32]  Tiziano Flati,et al.  Three Birds (in the LLOD Cloud) with One Stone: BabelNet, Babelfy and the Wikipedia Bitaxonomy , 2014, SEMANTICS.

[33]  Martha Palmer,et al.  SemEval-2007 Task-17: English Lexical Sample, SRL and All Words , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[34]  Dan I. Moldovan,et al.  Word sense disambiguation of WordNet glosses , 2004, Comput. Speech Lang..

[35]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[36]  Simone Paolo Ponzetto,et al.  Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[37]  Roberto Navigli,et al.  SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[38]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[39]  Partha Pratim Talukdar,et al.  Automatic Gloss Finding for a Knowledge Base using Ontological Constraints , 2015, WSDM.

[40]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[41]  Roberto Navigli A Quick Tour of Word Sense Disambiguation, Induction and Related Approaches , 2012, SOFSEM.

[42]  Roberto Navigli,et al.  Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data , 2017, EMNLP.

[43]  Daniel S. Weld,et al.  Design Challenges for Entity Linking , 2015, TACL.

[44]  Philipp Cimiano,et al.  Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0 , 2014, LREC.

[45]  Raphaël Troncy,et al.  NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud , 2012, LDOW.

[46]  Chris Callison-Burch,et al.  Mapping the Paraphrase Database to WordNet , 2017, *SEM.

[47]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[48]  Denny Vrandecic,et al.  Wikidata: a new platform for collaborative data collection , 2012, WWW.

[49]  Marcello Pelillo,et al.  A Game-Theoretic Approach to Word Sense Disambiguation , 2016, CL.

[50]  Ken Litkowski Senseval-3 task: Word Sense Disambiguation of WordNet glosses , 2004, SENSEVAL@ACL.

[51]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[52]  Tiziano Flati,et al.  MultiWiBi: The multilingual Wikipedia bitaxonomy project , 2016, Artif. Intell..

[53]  Martha Palmer,et al.  The English all-words task , 2004, SENSEVAL@ACL.

[54]  Roberto Navigli,et al.  A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation , 2014, CL.

[55]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[56]  Roberto Navigli,et al.  A Unified Multilingual Semantic Representation of Concepts , 2015, ACL.

[57]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[58]  Ignacio Iacobacci,et al.  Embedding Words and Senses Together via Joint Knowledge-Enhanced Training , 2016, CoNLL.

[59]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[60]  Rada Mihalcea,et al.  Sense Clustering Using Wikipedia , 2013, RANLP.

[61]  Enrico Mensa,et al.  A Resource-Driven Approach for Anchoring Linguistic Resources to Conceptual Spaces , 2016, AI*IA.

[62]  P. Lafon Sur la variabilité de la fréquence des formes dans un corpus , 1980 .

[63]  Roberto Navigli,et al.  Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia , 2016, IJCAI.

[64]  Paola Velardi,et al.  Learning Word-Class Lattices for Definition and Hypernym Extraction , 2010, ACL.

[65]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[66]  Lucy Vanderwende,et al.  MindNet: Acquiring and Structuring Semantic Information from Text , 1998, COLING-ACL.

[67]  Sebastian Hellmann,et al.  The Web of Data : Decentralized , collaborative , interlinked and interoperable , 2012 .

[68]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[69]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[70]  Roberto Navigli,et al.  BabelDomains: Large-Scale Domain Labeling of Lexical Resources , 2017, EACL.

[71]  Christian Chiarcos,et al.  Towards a Linguistic Linked Open Data cloud: The Open Linguistics Working Group , 2011, TAL.

[72]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[73]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[74]  Sebastian Hellmann,et al.  N³ - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format , 2014, LREC.

[75]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[76]  Roberto Navigli,et al.  Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity , 2013, ACL.

[77]  Elena Cabrio,et al.  Towards Lifelong Object Learning by Integrating Situated Robot Perception and Semantic Web Mining , 2016, ECAI.

[78]  Roberto Navigli,et al.  EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text , 2017, ACL.

[79]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[80]  Roberto Navigli,et al.  Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis , 2015, TACL.

[81]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.