MASS: Masked Sequence to Sequence Pre-training for Language Generation

Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks. MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment. In this way, MASS can jointly train the encoder and decoder to develop the capability of representation extraction and language modeling. By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over the baselines without pre-training or with other pre-training methods. Specially, we achieve the state-of-the-art accuracy (37.5 in terms of BLEU score) on the unsupervised English-French translation, even beating the early attention-based supervised model.

[1]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[2]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[3]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[4]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[5]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[6]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[7]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[8]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[9]  Eneko Agirre,et al.  Unsupervised Statistical Machine Translation , 2018, EMNLP.

[10]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  Di He,et al.  Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[16]  Cristian Danescu-Niculescu-Mizil,et al.  Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[17]  Zhiyuan Liu,et al.  Neural Headline Generation with Minimum Risk Training , 2016, ArXiv.

[18]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Masaaki Nagata,et al.  Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization , 2016, EACL.

[22]  Di He,et al.  Double Path Networks for Sequence to Sequence Learning , 2018, COLING.

[23]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[27]  Wei Chen,et al.  Unsupervised Neural Machine Translation with Weight Sharing , 2018 .

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[33]  Di He,et al.  Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[34]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[35]  Philip Bachman,et al.  Machine Comprehension by Text-to-Text Neural Question Generation , 2017, Rep4NLP@ACL.

[36]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[37]  Xu Tan,et al.  Unsupervised Pivot Translation for Distant Languages , 2019, ACL.

[38]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[39]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[40]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[41]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[42]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[43]  Di He,et al.  Dense Information Flow for Neural Machine Translation , 2018, NAACL.

[44]  Andrew M. Dai,et al.  MaskGAN: Better Text Generation via Filling in the ______ , 2018, ICLR.

[45]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[46]  Jiajun Zhang,et al.  Exploiting Source-side Monolingual Data in Neural Machine Translation , 2016, EMNLP.

[47]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[48]  Yaser Al-Onaizan,et al.  Zero-Resource Translation with Multi-Lingual Neural Machine Translation , 2016, EMNLP.

[49]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[50]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[51]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[52]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[53]  Xiaogang Wang,et al.  Learning Deep Representation with Large-Scale Attributes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[55]  Xu Tan,et al.  Almost Unsupervised Text to Speech and Automatic Speech Recognition , 2019, ICML.

[56]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[57]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[58]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.