Unsupervised Neural Machine Translation with Generative Language Models Only

We show how to derive state-of-the-art unsupervised neural machine translation systems from generatively pre-trained language models. Our method consists of three steps: few-shot amplification, distillation, and backtranslation. We first use the zero-shot translation ability of large pre-trained language models to generate translations for a small set of unlabeled sentences. We then amplify these zeroshot translations by using them as few-shot demonstrations for sampling a larger synthetic dataset. This dataset is distilled by discarding the few-shot demonstrations and then fine-tuning. During backtranslation, we repeatedly generate translations for a set of inputs and then fine-tune a single language model on both directions of the translation task at once, ensuring cycle-consistency by swapping the roles of gold monotext and generated translations when fine-tuning. By using our method to leverage GPT-3’s zero-shot translation capability, we achieve a new state-of-the-art in unsupervised translation on the WMT14 English-French benchmark, attaining a BLEU score of 42.1.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Doug Downey,et al.  G-DAug: Generative Data Augmentation for Commonsense Reasoning , 2020, FINDINGS.

[3]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Regina Barzilay,et al.  Style Transfer from Non-Parallel Text by Cross-Alignment , 2017, NIPS.

[5]  Guillaume Lample,et al.  DOBF: A Deobfuscation Pre-Training Objective for Programming Languages , 2021, NeurIPS.

[6]  Timo Schick,et al.  Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[7]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[8]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[9]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[10]  Eunah Cho,et al.  Data Augmentation using Pre-trained Transformer Models , 2020, LIFELONGNLP.

[11]  Wei Zhao,et al.  Denoising based Sequence-to-Sequence Pre-training for Text Generation , 2019, EMNLP.

[12]  Kenneth Heafield,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers , 2016, Annual Meeting of the Association for Computational Linguistics.

[13]  Yannis Papanikolaou,et al.  DARE: Data Augmented Relation Extraction with GPT-2 , 2020, ArXiv.

[14]  Christian Szegedy,et al.  Mathematical Reasoning via Self-supervised Skip-tree Training , 2020, ICLR.

[15]  Ondrej Bojar,et al.  Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[16]  Markus Freitag,et al.  Scaling Laws for Neural Machine Translation , 2021, ICLR.

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Yichao Lu,et al.  Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings , 2020, Transactions of the Association for Computational Linguistics.

[19]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[20]  Boi Faltings,et al.  Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems , 2021, EMNLP.

[21]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Shafiq Joty,et al.  Cross-model Back-translated Distillation for Unsupervised Machine Translation , 2020 .

[24]  Eric P. Xing,et al.  Controllable Text Generation , 2017, ArXiv.

[25]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[26]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[27]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[28]  Kevin Knight,et al.  Deciphering Foreign Language , 2011, ACL.

[29]  Ryan Cotterell,et al.  Explaining and Generalizing Back-Translation through Wake-Sleep , 2018, ArXiv.

[30]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[31]  Timo Schick,et al.  Generating Datasets with Pretrained Language Models , 2021, EMNLP.

[32]  Hai Zhao,et al.  Cross-lingual Supervision Improves Unsupervised Neural Machine Translation , 2020, NAACL.

[33]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[34]  Mark Chen,et al.  Scaling Laws for Autoregressive Generative Modeling , 2020, ArXiv.

[35]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[36]  Furu Wei,et al.  XLM-T: Scaling up Multilingual Machine Translation with Pretrained Cross-lingual Transformer Encoders , 2020, ArXiv.

[37]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[38]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[39]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[40]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[41]  Quoc V. Le,et al.  STraTA: Self-Training with Task Augmentation for Better Few-shot Learning , 2021, EMNLP.

[42]  Ateret Anaby-Tavor,et al.  Do Not Have Enough Data? Deep Learning to the Rescue! , 2020, AAAI.

[43]  Roger B. Grosse,et al.  LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning , 2021, ICML.

[44]  Marie-Francine Moens,et al.  Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction , 2015, ACL.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  Andy Way,et al.  Investigating Backtranslation in Neural Machine Translation , 2018, EAMT.

[47]  James T. Kwok,et al.  Generalizing from a Few Examples , 2019, ACM Comput. Surv..

[48]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.