Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) language typology. In addition to yielding several heuristics, the experiments form a framework for evaluating the data sensitivities of machine translation systems. While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU. In answer to our title’s question, mBART is not a low-resource panacea; we therefore encourage shifting the emphasis from new models to new data.

[1]  Rishemjit Kaur,et al.  Neural Machine Translation for Low-resource Languages: A Survey , 2021, ACM Comput. Surv..

[2]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[3]  Surangika Ranathunga,et al.  Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT , 2021, 2021 Moratuwa Engineering Research Conference (MERCon).

[4]  Katharina Kann,et al.  Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas , 2021, AMERICASNLP.

[5]  Katharina Kann,et al.  How to Adapt Your Pretrained Multilingual Model to 1600 Languages , 2021, ACL.

[6]  Zhong Zhou,et al.  Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Text-based Translation , 2021, SIGTYP.

[7]  Pascale Fung,et al.  Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation , 2021, FINDINGS.

[8]  Genta Indra Winata,et al.  IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation , 2021, EMNLP.

[9]  David Ifeoluwa Adelani,et al.  The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation , 2021, MTSUMMIT.

[10]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[11]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[12]  Rico Sennrich,et al.  Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation , 2020, ACL.

[13]  Genta Indra Winata,et al.  On the Importance of Word Order Information in Cross-lingual Sequence Labeling , 2020, AAAI.

[14]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web , 2019, ACL.

[15]  Vishrav Chaudhary,et al.  Multilingual Translation from Denoising Pre-Training , 2021, FINDINGS.

[16]  Gihan Dias,et al.  Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation , 2020, ArXiv.

[17]  Arya D. McCarthy,et al.  Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation , 2020, ACL.

[18]  David Yarowsky,et al.  Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages , 2020, LREC.

[19]  David Yarowsky,et al.  An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages , 2020, LREC.

[20]  David Yarowsky,et al.  The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration , 2020, LREC.

[21]  Monojit Choudhury,et al.  The State and Fate of Linguistic Diversity and Inclusion in the NLP World , 2020, ACL.

[22]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[23]  Barry Haddow,et al.  PMIndia - A Collection of Parallel Corpora of Languages of India , 2020, ArXiv.

[24]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[25]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[26]  Parag Singla,et al.  Transfer Learning for Related Languages: Submissions to the WMT20 Similar Language Translation Task , 2020, WMT.

[27]  Samuel R. Bowman,et al.  Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set , 2019, EMNLP.

[28]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[29]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[30]  Chris Hokamp,et al.  Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models , 2019, WMT.

[31]  Laura Martinus,et al.  Benchmarking Neural Machine Translation for Southern African Languages , 2019, WNLP@ACL.

[32]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[33]  Graham Neubig,et al.  Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[34]  Holger Schwenk,et al.  Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[35]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[36]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[37]  Tetsuji Nakagawa,et al.  An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation , 2017, PACLIC.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[40]  Patrick Littell,et al.  URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , 2017, EACL.

[41]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[42]  Matt Post,et al.  Beyond bitext: Five open problems in machine translation , 2013 .

[43]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.