Beyond English-Centric Multilingual Machine Translation

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

[1]  Peng-Jen Chen,et al.  The Source-Target Domain Mismatch Problem in Machine Translation , 2019, EACL.

[2]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[3]  Jörg Tiedemann,et al.  The University of Helsinki Submissions to the WMT19 News Translation Task , 2019, WMT.

[4]  Holger Schwenk,et al.  Investigations on large-scale lightly-supervised training for statistical machine translation. , 2008, IWSLT.

[5]  Orevaoghene Ahia,et al.  Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin , 2020, ArXiv.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Philipp Koehn,et al.  Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings , 2019, WMT.

[8]  Jan Niehues,et al.  Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder , 2016, IWSLT.

[9]  Ildoo Kim,et al.  torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models , 2020, ArXiv.

[10]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[11]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[12]  Christof Monz,et al.  Ensemble Learning for Multi-Source Neural Machine Translation , 2016, COLING.

[13]  Masao Utiyama,et al.  Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[14]  Victor O. K. Li,et al.  Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[15]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[16]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[17]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[18]  Moussa Lo,et al.  Using LSTM to Translate French to Senegalese Local Languages: Wolof as a Case Study , 2020, ArXiv.

[19]  Paul Rayson,et al.  Igbo-English Machine Translation: An Evaluation Benchmark , 2020, ArXiv.

[20]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21]  Bonaventure F. P. Dossou,et al.  FFR v1.1: Fon-French Neural Machine Translation , 2020, WINLP.

[22]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[23]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[24]  Kenneth Heafield,et al.  Parallel Sentence Mining by Constrained Decoding , 2020, ACL.

[25]  Ankur Bapna,et al.  Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation , 2020, ACL.

[26]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[27]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[28]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[29]  Marcis Pinnis,et al.  Tilde's Machine Translation Systems for WMT 2019 , 2019, WMT.

[30]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[31]  Eneko Agirre,et al.  Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining , 2020, ACL.

[32]  Philipp Koehn,et al.  Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions , 2019, WMT.

[33]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[34]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[35]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[36]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[38]  Graham Neubig,et al.  When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[39]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[40]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[41]  Masao Utiyama,et al.  Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian , 2016, WAT@COLING.

[42]  Rico Sennrich,et al.  Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[43]  Yaregal Assabie,et al.  Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs , 2018 .

[44]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[45]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[46]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[47]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[50]  Ciprian Chelba,et al.  Tagged Back-Translation , 2019, WMT.

[51]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[52]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[53]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[54]  Julian Salazar,et al.  Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.

[55]  Philipp Koehn,et al.  The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English , 2019, EMNLP.

[56]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[57]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[58]  Huda Khayrallah,et al.  Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[59]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[60]  Yulia Tsvetkov,et al.  Balancing Training for Multilingual Neural Machine Translation , 2020, ACL.

[61]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[62]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[63]  Ondrej Bojar,et al.  Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[64]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[65]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[66]  Yong Wang,et al.  Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations , 2019, ACL.

[67]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[68]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[69]  Gholamreza Haffari,et al.  Iterative Back-Translation for Neural Machine Translation , 2018, NMT@ACL.

[70]  Jungo Kasai,et al.  Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.

[71]  Peng-Jen Chen,et al.  Facebook AI’s WAT19 Myanmar-English Translation Task Submission , 2019, EMNLP.

[72]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[73]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[74]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[75]  Yoshua Bengio,et al.  Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[76]  Rico Sennrich,et al.  The University of Edinburgh’s Neural MT Systems for WMT17 , 2017, WMT.

[77]  Moussa Lo,et al.  Using LSTM Networks to Translate French to Senegalese Local Languages: Wolof as a Case Study , 2020 .

[78]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Feifei Zhai,et al.  Three Strategies to Improve One-to-Many Multilingual Translation , 2018, EMNLP.

[80]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[81]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[82]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT19 , 2019, WMT.

[83]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[84]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[85]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[86]  Mauro Cettolo,et al.  Overview of the IWSLT 2017 Evaluation Campaign , 2017, IWSLT.

[87]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[88]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.