论文信息 - Beyond English-Centric Multilingual Machine Translation

Beyond English-Centric Multilingual Machine Translation

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

[1] Peng-Jen Chen,et al. The Source-Target Domain Mismatch Problem in Machine Translation , 2019, EACL.

[2] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[3] Jörg Tiedemann,et al. The University of Helsinki Submissions to the WMT19 News Translation Task , 2019, WMT.

[4] Holger Schwenk,et al. Investigations on large-scale lightly-supervised training for statistical machine translation. , 2008, IWSLT.

[5] Orevaoghene Ahia,et al. Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin , 2020, ArXiv.

[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7] Philipp Koehn,et al. Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings , 2019, WMT.

[8] Jan Niehues,et al. Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder , 2016, IWSLT.

[9] Ildoo Kim,et al. torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models , 2020, ArXiv.

[10] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[11] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[12] Christof Monz,et al. Ensemble Learning for Multi-Source Neural Machine Translation , 2016, COLING.

[13] Masao Utiyama,et al. Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[14] Victor O. K. Li,et al. Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[15] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[16] Philipp Koehn,et al. Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[17] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[18] Moussa Lo,et al. Using LSTM to Translate French to Senegalese Local Languages: Wolof as a Case Study , 2020, ArXiv.

[19] Paul Rayson,et al. Igbo-English Machine Translation: An Evaluation Benchmark , 2020, ArXiv.

[20] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21] Bonaventure F. P. Dossou,et al. FFR v1.1: Fon-French Neural Machine Translation , 2020, WINLP.

[22] Zeljko Agic,et al. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[23] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[24] Kenneth Heafield,et al. Parallel Sentence Mining by Constrained Decoding , 2020, ACL.

[25] Ankur Bapna,et al. Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation , 2020, ACL.