ANVITA-African: A Multilingual Neural Machine Translation System for African Languages

This paper describes ANVITA African NMT system submitted by team ANVITA for WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the constrained translation track. The team participated in 24 African languages to English MT directions. For better handling of relatively low resource language pairs and effective transfer learning, models are trained in multilingual setting. Heuristic based corpus filtering is applied and it improved performance by 0.04-2.06 BLEU across 22 out of 24 African to English directions and also improved training time by 5x. Use of deep transformer with 24 layers of encoder and 6 layers of decoder significantly improved performance by 1.1-7.7 BLEU across all the 24 African to English directions compared to base transformer. For effective selection of source vocabulary in multilingual setting, joint and language wise vocabulary selection strategies are explored at the source side. Use of language wise vocabulary selection however did not consistently improve performance of low resource languages in comparison to joint vocabulary selection. Empirical results indicate that training using deep transformer with filtered corpora seems to be a better choice than using base transformer on the whole corpora both in terms of accuracy and training time.

[1]  Shannon L. Spruit,et al.  No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[2]  David Ifeoluwa Adelani,et al.  A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation , 2022, NAACL.

[3]  David Ifeoluwa Adelani,et al.  Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages , 2022, WMT.

[4]  Bonaventure F. P. Dossou,et al.  MMTAfrica: Multilingual Machine Translation for African Languages , 2022, WMT.

[5]  Furu Wei,et al.  Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task , 2021, WMT.

[6]  Jonathan May,et al.  Many-to-English Machine Translation Tools, Data, and Pretrained Models , 2021, ACL.

[7]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[8]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT21 , 2021, WMT.

[9]  Pavanpankaj Vegi,et al.  ANVITA Machine Translation System for WAT 2021 MultiIndicMT Shared Task , 2021, WAT.

[10]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT20 , 2021, WMT.

[11]  Rico Sennrich,et al.  Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[12]  Jonathan May,et al.  Finding the Optimal Vocabulary Size for Neural Machine Translation , 2020, FINDINGS.

[13]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[14]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[15]  Zhongjun He,et al.  Robust Neural Machine Translation with Joint Textual and Phonetic Embedding , 2018, ACL.

[16]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT19 , 2019, WMT.

[17]  Marcis Pinnis,et al.  Tilde’s Parallel Corpus Filtering Methods for WMT 2018 , 2018, WMT.

[18]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[21]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.