CUNI Systems for the Unsupervised News Translation Task in WMT 2019

In this paper we describe the CUNI translation system used for the unsupervised news shared task of the ACL 2019 Fourth Conference on Machine Translation (WMT19). We follow the strategy of Artexte et al. (2018b), creating a seed phrase-based system where the phrase table is initialized from cross-lingual embedding mappings trained on monolingual data, followed by a neural machine translation system trained on synthetic parallel data. The synthetic corpus was produced from a monolingual corpus by a tuned PBMT model refined through iterative back-translation. We further focus on the handling of named entities, i.e. the part of vocabulary where the cross-lingual embedding mapping suffers most. Our system reaches a BLEU score of 15.3 on the German-Czech WMT19 shared task.

[1]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[2]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[3]  Meng Zhang,et al.  Adversarial Training for Unsupervised Bilingual Lexicon Induction , 2017, ACL.

[4]  Ondrej Bojar,et al.  English-Czech Systems in WMT19: Document-Level Transformer , 2019, WMT.

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[7]  Jan Hajic,et al.  Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition , 2014, ACL.

[8]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[9]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[10]  Khalil Sima'an,et al.  Fitting Sentence Level Translation Evaluation with Many Dense Features , 2014, EMNLP.

[11]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[12]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[13]  Ondrej Bojar,et al.  Trivial Transfer Learning for Low-Resource Neural Machine Translation , 2018, WMT.

[14]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[15]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[16]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[17]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[18]  Dominik Macháček Enriching Neural MT through Multi-Task Training , 2018 .

[19]  Eneko Agirre,et al.  Unsupervised Statistical Machine Translation , 2018, EMNLP.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.