The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation

Supervised Neural Machine Translation (NMT) systems currently achieve impressive translation quality for many language pairs. One of the key features of a correct translation is the ability to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. Existing evaluation benchmarks on WSD capabilities of translation systems rely heavily on manual work and cover only few language pairs and few word types. We present MUCOW, a multilingual contrastive test suite that covers 16 language pairs with more than 200 000 contrastive sentence pairs, automatically built from word-aligned parallel corpora and the wide-coverage multilingual sense inventory of BabelNet. We evaluate the quality of the ambiguity lexicons and of the resulting test suite on all submissions from 9 language pairs presented in the WMT19 news shared translation task, plus on other 5 language pairs using pretrained NMT models. The MUCOW test suite is available at http://github. com/Helsinki-NLP/MuCoW.

[1]  Sharid Loáiciga,et al.  A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018 , 2018, WMT.

[2]  Hans Uszkoreit,et al.  Fine-grained evaluation of German-English Machine Translation based on a Test Suite , 2018, WMT.

[3]  Roberto Navigli,et al.  Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance , 2006, ACL.

[4]  Rico Sennrich,et al.  The Word Sense Disambiguation Test Suite at WMT18 , 2018, WMT.

[5]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[6]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[7]  Pierre Isabelle,et al.  A Challenge Set Approach to Evaluating Machine Translation , 2017, EMNLP.

[8]  Liane Guillou,et al.  PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation , 2016, LREC.

[9]  Alexandru Ceausu,et al.  South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and , 2010 .

[10]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Laura Mascarell,et al.  Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings , 2017, WMT.

[13]  Rico Sennrich,et al.  How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs , 2016, EACL.

[14]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[15]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[16]  M. A. R T H A P A L,et al.  Making fine-grained and coarse-grained sense distinctions , both manually and automatically , 2005 .

[17]  Philipp Koehn,et al.  Exploring Word Sense Disambiguation Abilities of Neural Machine Translation Systems (Non-archival Extended Abstract) , 2018, AMTA.

[18]  François Yvon,et al.  The WMT’18 Morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English , 2018, WMT.

[19]  Rico Sennrich,et al.  A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation , 2018, WMT.

[20]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[21]  Rico Sennrich,et al.  The University of Edinburgh’s Neural MT Systems for WMT17 , 2017, WMT.

[22]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[23]  G. Heigold,et al.  A Linguistic Evaluation of Rule-Based, Phrase-Based, and Neural MT Engines , 2017, Prague Bull. Math. Linguistics.

[24]  Frederick Liu,et al.  Handling Homographs in Neural Machine Translation , 2017, NAACL.

[25]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[26]  François Yvon,et al.  Evaluating the morphological competence of Machine Translation Systems , 2017, WMT.

[27]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[28]  Mamoru Komachi,et al.  RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation , 2018, WMT.

[29]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[31]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[32]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[33]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[34]  Ignacio Iacobacci,et al.  Embedding Words and Senses Together via Joint Knowledge-Enhanced Training , 2016, CoNLL.

[35]  Josef van Genabith,et al.  ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks , 2015, EMNLP.

[36]  Tapio Salakoski,et al.  Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks , 2019, Natural Language Engineering.

[37]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[38]  Rico Sennrich,et al.  Evaluating Discourse Phenomena in Neural Machine Translation , 2017, NAACL.

[39]  Joakim Nivre,et al.  An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation , 2018, WMT.

[40]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.