Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages

Research in machine translation (MT) is developing at a rapid pace. However, most work in the community has focused on languages where large amounts of digital resources are available. In this study, we benchmark state of the art statistical and neural machine translation systems on two African languages which do not have large amounts of resources: Somali and Swahili. These languages are of social importance and serve as test-beds for developing technologies that perform reasonably well despite the low-resource constraint. Our findings suggest that statistical machine translation (SMT) and neural machine translation (NMT) can perform similarly in low-resource scenarios, but neural systems require more careful tuning to match performance. We also investigate how to exploit additional data, such as bilingual text harvested from the web, or user dictionaries; we find that NMT can significantly improve in performance with the use of these additional data. Finally, we survey the landscape of machine translation resources for the languages of Africa and provide some suggestions for promising future research directions.

[1]  Gholamreza Haffari,et al.  Adaptively Scheduled Multitask Learning: The Case of Low-Resource Neural Machine Translation , 2019, EMNLP.

[2]  Sree Harsha Ramesh,et al.  Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora , 2018, NAACL.

[3]  Yulia Tsvetkov,et al.  Lexicon Stratification for Translating Out-of-Vocabulary Words , 2015, ACL.

[4]  Victor O. K. Li,et al.  Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[5]  David Chiang,et al.  Improving Lexical Choice in Neural Machine Translation , 2017, NAACL.

[6]  Philipp Koehn,et al.  Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions , 2019, WMT.

[7]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[8]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[9]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[10]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[11]  Kevin Duh,et al.  Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Huanbo Luan,et al.  Improving Back-Translation with Uncertainty-based Confidence Estimation , 2019, EMNLP.

[13]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[14]  Huda Khayrallah,et al.  Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation , 2019, NAACL.

[15]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[16]  Eneko Agirre,et al.  An Effective Approach to Unsupervised Machine Translation , 2019, ACL.

[17]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[18]  Dekai Wu,et al.  Improving word alignment for low resource languages using English monolingual SRL , 2016, HyTra@COLING.

[19]  Matt Post,et al.  Joshua 5.0: Sparser, Better, Faster, Server , 2013, WMT@ACL.

[20]  David Chiang,et al.  Leveraging translations for speech transcription in low-resource settings , 2018, INTERSPEECH.

[21]  Huda Khayrallah,et al.  HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation , 2019, EMNLP.

[22]  Matt Post,et al.  We start by defining the recurrent architecture as implemented in S OCKEYE , following , 2018 .

[23]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[24]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[25]  Graham Neubig,et al.  Multilingual Neural Machine Translation With Soft Decoupled Encoding , 2019, ICLR.

[26]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[27]  Graham Neubig,et al.  Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[28]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[29]  Huda Khayrallah,et al.  Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting , 2019, NAACL.

[30]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[31]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[32]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[33]  Raj Dabre,et al.  Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation , 2019, EMNLP.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Walter Scheirer,et al.  Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation , 2019, EMNLP.

[36]  Ondrej Bojar,et al.  Trivial Transfer Learning for Low-Resource Neural Machine Translation , 2018, WMT.

[37]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.