Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

For most language combinations and parallel data is either scarce or simply unavailable. To address this and unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising and while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To this date and the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT (up to +4.3 BLEU and af2en) as well as statistical (+50.8 BLEU) and hybrid UMT (+51.5 BLEU) baselines on related and distantly-related and unrelated language pairs.

[1]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[2]  Josef van Genabith,et al.  Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation , 2020, EMNLP.

[3]  Josef van Genabith,et al.  Neural machine translation for low-resource languages without parallel corpora , 2017, Machine Translation.

[4]  Antonio Toral,et al.  Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution , 2020, EAMT.

[5]  Pushpak Bhattacharyya,et al.  Ordering Matters: Word Ordering Aware Unsupervised NMT , 2019, ArXiv.

[6]  Gholamreza Haffari,et al.  Iterative Back-Translation for Neural Machine Translation , 2018, NMT@ACL.

[7]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[8]  Patrick Littell,et al.  URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , 2017, EACL.

[9]  Xu Tan,et al.  Unsupervised Pivot Translation for Distant Languages , 2019, ACL.

[10]  Laura Martinus,et al.  A Focus on Neural Machine Translation for African Languages , 2019, ArXiv.

[11]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[12]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[13]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[14]  Ondrej Bojar,et al.  Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[15]  Martin J. Puttkammer,et al.  Dataset for comparable evaluation of machine translation between 11 South African languages , 2020, Data in brief.

[16]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[17]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[18]  Hai Zhao,et al.  Reference Language based Unsupervised Neural Machine Translation , 2020, FINDINGS.

[19]  Hermann Ney,et al.  When and Why is Unsupervised Neural Machine Translation Useless? , 2020, EAMT.

[20]  David Ifeoluwa Adelani,et al.  MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation , 2020, ArXiv.

[21]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[24]  Eneko Agirre,et al.  An Effective Approach to Unsupervised Machine Translation , 2019, ACL.

[25]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[26]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[27]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[28]  Siyang Li,et al.  Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources , 2021, ArXiv.

[29]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[30]  Ming Zhou,et al.  Unsupervised Neural Machine Translation with SMT as Posterior Regularization , 2019, AAAI.

[31]  David Ifeoluwa Adelani,et al.  The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation , 2021, MTSUMMIT.

[32]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[33]  Khin Mar Soe,et al.  Large Scale Myanmar to English Neural Machine Translation System , 2018, 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE).

[34]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[35]  Wei Chen,et al.  Unsupervised Neural Machine Translation with Weight Sharing , 2018 .

[36]  Kenneth Heafield,et al.  Copied Monolingual Data Improves Low-Resource Neural Machine Translation , 2017, WMT.

[37]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[38]  Jan Niehues,et al.  Unsupervised Machine Translation On Dravidian Languages , 2021, DRAVIDIANLANGTECH.

[39]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[40]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[41]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[42]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[43]  Kevin Duh,et al.  When Does Unsupervised Machine Translation Work? , 2020, WMT@EMNLP.

[44]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[45]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[46]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[47]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[48]  Josef van Genabith,et al.  Self-Supervised Neural Machine Translation , 2019, ACL.

[49]  Enhong Chen,et al.  Joint Training for Neural Machine Translation Models with Monolingual Data , 2018, AAAI.

[50]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[51]  Marine Carpuat,et al.  Bi-Directional Neural Machine Translation with Synthetic Parallel Data , 2018, NMT@ACL.

[52]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[53]  Cristina España-Bonet,et al.  UdS-DFKI Participation at WMT 2019: Low-Resource (en-gu) and Coreference-Aware (en-de) Systems , 2019, WMT.

[54]  Pushpak Bhattacharyya,et al.  Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders , 2019, ACL.

[55]  Philipp Koehn,et al.  Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings , 2019, WMT.

[56]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[57]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[58]  Matteo Negri,et al.  Low Resource Neural Machine Translation: A Benchmark for Five African Languages , 2020, AfricaNLP.