Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a lowresource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for lowresource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

[1]  Matteo Negri,et al.  Adapting Multilingual Neural Machine Translation to Unseen Languages , 2019, IWSLT.

[2]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[3]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[4]  Hai Zhao,et al.  Reference Language based Unsupervised Neural Machine Translation , 2020, FINDINGS.

[5]  Kevin Knight,et al.  Deciphering Related Languages , 2017, EMNLP.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[8]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[9]  Kenneth Heafield,et al.  Zero-Resource Neural Machine Translation with Monolingual Pivot Data , 2019, EMNLP.

[10]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[11]  Graham Neubig,et al.  Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[12]  Yong Wang,et al.  Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations , 2019, ACL.

[13]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[14]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[15]  Graham Neubig,et al.  When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[16]  Graham Neubig,et al.  TICO-19: the Translation Initiative for Covid-19 , 2020, NLP4COVID@EMNLP.

[17]  Xiaodong Gu,et al.  DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder , 2018, ICLR.

[18]  Ahmed Abdelali,et al.  The AMARA corpus: building resources for translating the web’s educational content , 2013, IWSLT.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[21]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[22]  Yuqing Tang,et al.  Cross-lingual Retrieval for Iterative Self-Supervised Training , 2020, NeurIPS.

[23]  Orhan Firat,et al.  Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages , 2020, NAACL.

[24]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[25]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[26]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[27]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[28]  Lidia S. Chao,et al.  Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling , 2020, AAAI.

[29]  Marcello Federico,et al.  Neural Machine Translation into Language Varieties , 2018, WMT.

[30]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[31]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[32]  Yang Liu,et al.  A Teacher-Student Framework for Zero-Resource Neural Machine Translation , 2017, ACL.

[33]  Hany Hassan,et al.  Synthetic Data for Neural Machine Translation of Spoken-Dialects , 2017, IWSLT.

[34]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[35]  Graham Neubig,et al.  Generalized Data Augmentation for Low-Resource Translation , 2019, ACL.

[36]  Hai Zhao,et al.  Cross-lingual Supervision Improves Unsupervised Neural Machine Translation , 2020, NAACL.

[37]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[38]  Ankur P. Parikh,et al.  Consistency by Agreement in Zero-Shot Neural Machine Translation , 2019, NAACL.

[39]  Ankur Bapna,et al.  Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation , 2020, ACL.

[40]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[41]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.

[42]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[43]  Deniz Yuret,et al.  KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[44]  Ankur Bapna,et al.  The Missing Ingredient in Zero-Shot Neural Machine Translation , 2019, ArXiv.