Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources

We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages, exploring the case when both parallel training data and compute resource are lacking, reflecting the reality of most of the world’s languages and the researchers working on these languages. We propose a simple and scalable method to improve unsupervised NMT, showing how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance. We also demonstrate how the use of the dictionary to code-switch monolingual data to create more comparable data can further improve performance. With this weak supervision, our best method achieves BLEU scores that improve over supervised results for English→Gujarati (+18.88), English→Kazakh (+5.84), and English→Somali (+1.16), showing the promise of weakly-supervised NMT for many low resource languages with modest compute resource in the world. To the best of our knowledge, our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.

[1]  Matt Post,et al.  The Language Demographics of Amazon Mechanical Turk , 2014, TACL.

[2]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[3]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[4]  Anna Korhonen,et al.  Semantic Specialization of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints , 2017, TACL.

[5]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[6]  Jonathan Pool,et al.  PanLex: Building a Resource for Panlingual Lexical Translation , 2014, LREC.

[7]  Chris Callison-Burch,et al.  Learning Translations via Matrix Completion , 2017, EMNLP.

[8]  Chris Callison-Burch,et al.  A Comprehensive Analysis of Bilingual Lexicon Induction , 2017, CL.

[9]  Víctor M. Sánchez-Cartagena,et al.  The Universitat d'Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019 , 2019, WMT.

[10]  Regina Barzilay,et al.  Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing , 2019, NAACL.

[11]  Hua Wu,et al.  PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable , 2020, ACL.

[12]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[13]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[14]  Max Mühlhäuser,et al.  Analyzing and accessing Wikipedia as a lexical semantic resource , 2007 .

[15]  Tom M. Mitchell,et al.  “A Spousal Relation Begins with a Deletion of engage and Ends with an Addition of divorce”: Learning State Changing Verbs from Wikipedia Revision History , 2015, EMNLP.

[16]  Libo Qin,et al.  CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP , 2020, ArXiv.

[17]  Johannes Dellert,et al.  NorthEuraLex: a wide-coverage lexical database of Northern Eurasia , 2019, Lang. Resour. Evaluation.

[18]  Alexander M. Fraser,et al.  Target-side Word Segmentation Strategies for Neural Machine Translation , 2017, WMT.

[19]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[20]  Hermann Ney,et al.  When and Why is Unsupervised Neural Machine Translation Useless? , 2020, EAMT.

[21]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[22]  Kevin Knight,et al.  Using Word Vectors to Improve Word Alignments for Low Resource Machine Translation , 2018, NAACL.

[23]  Tetsuji Nakagawa,et al.  An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation , 2017, PACLIC.

[24]  Seth Kulick,et al.  Corpus Building for Low Resource Languages in the DARPA LORELEI Program , 2019 .

[25]  Bryan Catanzaro,et al.  Large Scale Language Modeling: Converging on 40GB of Text in Four Hours , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[26]  Pierre Zweigenbaum,et al.  Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[27]  Pabitra Mitra,et al.  Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction , 2017, ACL.

[28]  Ondrej Bojar,et al.  Trivial Transfer Learning for Low-Resource Neural Machine Translation , 2018, WMT.

[29]  Monojit Choudhury,et al.  The State and Fate of Linguistic Diversity and Inclusion in the NLP World , 2020, ACL.

[30]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[31]  David Chiang,et al.  Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation , 2017, IJCNLP.

[32]  Antonio Toral,et al.  Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences , 2016, WMT.

[33]  Xin Wang,et al.  Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation , 2019, NAACL.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36]  Matteo Negri,et al.  Low Resource Neural Machine Translation: A Benchmark for Five African Languages , 2020, AfricaNLP.

[37]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[38]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[39]  Sree Harsha Ramesh,et al.  Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora , 2018, NAACL.

[40]  Holger Schwenk,et al.  A Corpus for Multilingual Document Classification in Eight Languages , 2018, LREC.

[41]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.

[42]  Víctor M. Sánchez-Cartagena,et al.  Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task , 2018, WMT.

[43]  Margrit Betke,et al.  Multi-Label and Multilingual News Framing Analysis , 2020, ACL.

[44]  Holger Schwenk,et al.  Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[45]  Nur Ahmed,et al.  The De-democratization of AI: Deep Learning and the Compute Divide in Artificial Intelligence Research , 2020, ArXiv.

[46]  Tie-Yan Liu,et al.  Machine Translation With Weakly Paired Documents , 2019, EMNLP.

[47]  Yichao Lu,et al.  Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings , 2020, Transactions of the Association for Computational Linguistics.

[48]  Alexander M. Fraser,et al.  Unsupervised Parallel Sentence Extraction from Comparable Corpora , 2018, IWSLT.

[49]  Jeremy Barnes,et al.  Bilingual Sentiment Embeddings: Joint Projection of Sentiment Across Languages , 2018, ACL.

[50]  Eneko Agirre,et al.  An Effective Approach to Unsupervised Machine Translation , 2019, ACL.

[51]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[52]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[53]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[54]  Yuqing Tang,et al.  Cross-lingual Retrieval for Iterative Self-Supervised Training , 2020, NeurIPS.

[55]  Marine Carpuat,et al.  The University of Maryland's Kazakh-English Neural Machine Translation System at WMT19 , 2019, WMT.

[56]  Angli Liu,et al.  Context Models for OOV Word Translation in Low-Resource Languages , 2018, AMTA.

[57]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[58]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[59]  Kevin Knight,et al.  Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation , 2019, ACL.

[60]  Tiejun Zhao,et al.  Unsupervised Bilingual Word Embedding Agreement for Unsupervised Neural Machine Translation , 2019, ACL.

[61]  Kevin Duh,et al.  When Does Unsupervised Machine Translation Work? , 2020, WMT@EMNLP.

[62]  O. Mamyrbayev,et al.  Neural Named Entity Recognition for Kazakh , 2020, CICLing.

[63]  Pushpak Bhattacharyya,et al.  Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders , 2019, ACL.

[64]  Viktor Hangya,et al.  Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation , 2019, ACL.

[65]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[66]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[67]  Chris Callison-Burch,et al.  Learning Translations via Images with a Massively Multilingual Image Dataset , 2018, ACL.

[68]  Huda Khayrallah,et al.  HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation , 2019, EMNLP.

[69]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[70]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[71]  Margrit Betke,et al.  Detecting Frames in News Headlines and Its Application to Analyzing News Framing Trends Surrounding U.S. Gun Violence , 2019, CoNLL.