Encoder-Agnostic Adaptation for Conditional Language Generation

Large pretrained language models have changed the way researchers approach discriminative natural language understanding tasks, leading to the dominance of approaches that adapt a pretrained model for arbitrary downstream tasks. However it is an open-question how to use similar techniques for language generation. Early results in the encoder-agnostic setting have been mostly negative. In this work we explore methods for adapting a pretrained language model to arbitrary conditional input. We observe that pretrained transformer models are sensitive to large parameter changes during tuning. We therefore propose an adaptation that directly injects arbitrary conditioning into self attention, an approach we call pseudo self attention. Through experiments on four diverse conditional text generation tasks we show that this encoder-agnostic technique outperforms strong baselines, produces coherent generations, and is data efficient.

[1]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[4]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[5]  Alexandros Potamianos,et al.  An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models , 2019, NAACL.

[6]  Sergey Edunov,et al.  Pre-trained language model representations for language generation , 2019, NAACL.

[7]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[8]  Ji Wang,et al.  Pretraining-Based Natural Language Generation for Text Summarization , 2019, CoNLL.

[9]  Alexander G. Schwing,et al.  Diverse and Coherent Paragraph Generation from Images , 2018, ECCV.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[13]  Sebastian Ruder,et al.  Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[14]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[15]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[16]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[17]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[18]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[19]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[20]  Veselin Stoyanov,et al.  Simple Fusion: Return of the Language Model , 2018, WMT.

[21]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[22]  Sergey I. Nikolenko,et al.  Large-Scale Transfer Learning for Natural Language Generation , 2019, ACL.

[23]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[24]  Alexander J. Smola,et al.  Language Models with Transformers , 2019, ArXiv.

[25]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[26]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[28]  Lei Yu,et al.  The Neural Noisy Channel , 2016, ICLR.

[29]  Regina Barzilay,et al.  Style Transfer from Non-Parallel Text by Cross-Alignment , 2017, NIPS.

[30]  Alexander M. Rush,et al.  Adversarially Regularized Autoencoders , 2017, ICML.

[31]  Alexander M. Rush,et al.  Training for Diversity in Image Paragraph Captioning , 2018, EMNLP.

[32]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[33]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.