EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text

Parallel corpora are widely used in a variety of Natural Language Processing tasks, from Machine Translation to cross-lingual Word Sense Disambiguation, where parallel sentences can be exploited to automatically generate high-quality sense annotations on a large scale. In this paper we present EUROSENSE, a multilingual sense-annotated resource based on the joint disambiguation of the Europarl parallel corpus, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities from a languageindependent unified sense inventory. We evaluate the quality of our sense annotations intrinsically and extrinsically, showing their effectiveness as training data for Word Sense Disambiguation.

[1]  Roberto Navigli,et al.  SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[2]  Eneko Agirre,et al.  Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.

[3]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[4]  Martine De Cock,et al.  ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation , 2011, ACL.

[5]  Hwee Tou Ng,et al.  One Million Sense-Tagged Instances for Word Sense Disambiguation and Induction , 2015, CoNLL.

[6]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[7]  Hwee Tou Ng,et al.  It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text , 2010, ACL.

[8]  Roberto Navigli,et al.  A Large-Scale Multilingual Disambiguation of Glosses , 2016, LREC.

[9]  Christiane Fellbaum,et al.  The MASC Word Sense Corpus , 2012, LREC.

[10]  Anna Korhonen,et al.  On the Role of Seed Lexicons in Learning Bilingual Word Embeddings , 2016, ACL.

[11]  Ivan Titov,et al.  Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders , 2016, HLT-NAACL.

[12]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[13]  Marine Carpuat,et al.  Retrofitting Sense-Specific Word Vectors Using Parallel Text , 2016, HLT-NAACL.

[14]  Roberto Navigli,et al.  Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison , 2017, EACL.

[15]  Véronique Hoste,et al.  SemEval-2013 Task 10: Cross-lingual Word Sense Disambiguation , 2013, *SEMEVAL.

[16]  Iryna Gurevych,et al.  Supersense Embeddings: A Unified Model for Supersense Interpretation, Prediction, and Utilization , 2016, ACL.

[17]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[18]  Lenhart K. Schubert Turing's Dream and the Knowledge Challenge , 2006, AAAI.

[19]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[20]  Phil Blunsom,et al.  Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[21]  Eneko Agirre,et al.  Random Walks for Knowledge-Based Word Sense Disambiguation , 2014, CL.

[22]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[23]  Francis Bond,et al.  Multilingual Sense Intersection in a Parallel Corpus with Diverse Language Families , 2016, GWC.

[24]  Hwee Tou Ng,et al.  Scaling Up Word Sense Disambiguation via Parallel Texts , 2005, AAAI.

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  Roberto Navigli,et al.  A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation , 2014, CL.

[27]  Hans Uszkoreit,et al.  Multi-Objective Optimization for the Joint Disambiguation of Nouns and Named Entities , 2015, ACL.

[28]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[29]  Roberto Navigli,et al.  SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking , 2015, *SEMEVAL.

[30]  Ignacio Iacobacci,et al.  SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[31]  Roberto Navigli,et al.  Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia , 2016, IJCAI.

[32]  Yoav Goldberg,et al.  Semi Supervised Preposition-Sense Disambiguation using Multilingual Data , 2016, COLING.

[33]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[34]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Hwee Tou Ng,et al.  Semi-Supervised Word Sense Disambiguation Using Word Embeddings in General and Specific Domains , 2015, NAACL.

[37]  Chris Callison-Burch,et al.  Expectations of Word Sense in Parallel Corpora , 2012, NAACL.

[38]  Francis Bond,et al.  A Survey of WordNet Annotated Corpora , 2014, GWC.

[39]  Guillaume Wenzek,et al.  Trans-gram, Fast Cross-lingual Word-embeddings , 2015, EMNLP.

[40]  Alexander F. Gelbukh,et al.  Is the Most Frequent Sense of a Word Better Connected in a Semantic Network? , 2015, ICIC.

[41]  Rebecca J. Passonneau,et al.  Annotating the MASC Corpus with BabelNet , 2014, LREC.

[42]  Ryan Doherty,et al.  Semi-supervised Word Sense Disambiguation with Neural Models , 2016, COLING.

[43]  Mark Dredze,et al.  Entity Linking: Finding Extracted Entities in a Knowledge Base , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[44]  Arantxa Otegi,et al.  QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages , 2016, LREC.

[45]  Hwee Tou Ng,et al.  Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study , 2003, ACL.

[46]  Marianna Apidianaki LIMSI : Cross-lingual Word Sense Disambiguation using Translation Sense Clustering , 2013, SemEval@NAACL-HLT.

[47]  Roberto Navigli,et al.  Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia , 2015, AI*IA.

[48]  Marine Carpuat,et al.  Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment , 2016, HLT-NAACL.

[49]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[50]  Ignacio Iacobacci,et al.  Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[51]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[52]  Mikael Kågebäck,et al.  Word Sense Disambiguation using a Bidirectional LSTM , 2016, CogALex@COLING.