A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Recent advances in the pre-training for language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages that are not well represented on the web and therefore excluded from the large-scale crawls for datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pretraining? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a novel African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both additional languages and additional domains is to leverage small quantities of high-quality translation data to fine-tune large pre-trained models.

[1]  Arya D. McCarthy,et al.  Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation? , 2022, FINDINGS.

[2]  Antonio Valerio Miceli Barone,et al.  Survey of Low-Resource Machine Translation , 2021, Computational Linguistics.

[3]  Marc'Aurelio Ranzato,et al.  The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , 2021, TACL.

[4]  Rami Al-Rfou,et al.  ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[5]  A. Nürnberger,et al.  Extended Parallel Corpus for Amharic-English Machine Translation , 2021, LREC.

[6]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[7]  Bonaventure F. P. Dossou,et al.  MMTAfrica: Multilingual Machine Translation for African Languages , 2022, WMT.

[8]  Furu Wei,et al.  Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task , 2021, WMT.

[9]  Yutaka Matsuo,et al.  AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages , 2021, EMNLP.

[10]  Vukosi Marivate,et al.  Umsuka English - isiZulu Parallel Corpus , 2021 .

[11]  Katharina Kann,et al.  Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas , 2021, AMERICASNLP.

[12]  Pascale Fung,et al.  Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data , 2021, ACL.

[13]  Pascale Fung,et al.  Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation , 2021, FINDINGS.

[14]  Andrew Caines,et al.  Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique , 2021, ArXiv.

[15]  Hatem Haddad,et al.  AI4D - African Language Program , 2021, ArXiv.

[16]  Bruce A. Bassett,et al.  Low-Resource Neural Machine Translation for Southern African Languages , 2021, ArXiv.

[17]  Jonathan May,et al.  Many-to-English Machine Translation Tools, Data, and Pretrained Models , 2021, ACL.

[18]  James Ben Hayfron-Acquah,et al.  NLP for Ghanaian Languages , 2021, ArXiv.

[19]  James Ben Hayfron-Acquah,et al.  English-Twi Parallel Corpus for Machine Translation , 2021, ArXiv.

[20]  Graham Neubig,et al.  MasakhaNER: Named Entity Recognition for African Languages , 2021, Transactions of the Association for Computational Linguistics.

[21]  Alp Öktem,et al.  Congolese Swahili Machine Translation for Humanitarian Response , 2021, ArXiv.

[22]  David Ifeoluwa Adelani,et al.  The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation , 2021, MTSUMMIT.

[23]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[24]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[25]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web , 2019, ACL.

[26]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[27]  Philipp Koehn,et al.  Facebook AI’s WMT21 News Translation Task Submission , 2021, WMT.

[28]  Jimmy J. Lin,et al.  Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages , 2021, MRL.

[29]  Wilker Aziz,et al.  Surprise Language Challenge: Developing a Neural Machine Translation System between Pashto and English in Two Months , 2021, MTSUMMIT.

[30]  Steven Bird,et al.  Decolonising Speech and Language Technology , 2020, COLING.

[31]  Hong Qu,et al.  KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi , 2020, COLING.

[32]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[33]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[34]  Graham Neubig,et al.  TICO-19: the Translation Initiative for Covid-19 , 2020, NLP4COVID@EMNLP.

[35]  Bonaventure F. P. Dossou,et al.  FFR v1.1: Fon-French Neural Machine Translation , 2020, WINLP.

[36]  David Yarowsky,et al.  The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration , 2020, LREC.

[37]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[38]  Philipp Koehn,et al.  When Does Unsupervised Machine Translation Work? , 2020, WMT.

[39]  Paul Rayson,et al.  Igbo-English Machine Translation: An Evaluation Benchmark , 2020, ArXiv.

[40]  Monojit Choudhury,et al.  The State and Fate of Linguistic Diversity and Inclusion in the NLP World , 2020, ACL.

[41]  Dan Roth,et al.  Extending Multilingual BERT to Low-Resource Languages , 2020, FINDINGS.

[42]  Alp Öktem,et al.  Tigrinya Neural Machine Translation with Transfer Learning for Humanitarian Response , 2020, ArXiv.

[43]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[44]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[45]  Parag Singla,et al.  Transfer Learning for Related Languages: Submissions to the WMT20 Similar Language Translation Task , 2020, WMT.

[46]  Stefan Riezler,et al.  Joey NMT: A Minimalist NMT Toolkit for Novices , 2019, EMNLP.

[47]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[48]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[49]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[50]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[51]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[52]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[53]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[54]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Yoshua Bengio,et al.  Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[57]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[58]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[59]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[60]  G. Lock The State and I , 1981 .