All NLP Tasks Are Generation Tasks: A General Pretraining Framework

There have been various types of pretraining architectures including autoregressive models (e.g., GPT), autoencoding models (e.g., BERT), and encoder-decoder models (e.g., T5). On the other hand, NLP tasks are different in nature, with three main categories being classification, unconditional generation, and conditional generation. However, none of the pretraining frameworks performs the best for all tasks, which introduces inconvenience for model development and selection. We propose a novel pretraining framework GLM (General Language Model) to address this challenge. Compared to previous work, our architecture has three major benefits: (1) it performs well on classification, unconditional generation, and conditional generation tasks with one single pretrained model; (2) it outperforms BERT-like models on classification due to improved pretrain-finetune consistency; (3) it naturally handles variable-length blank filling which is crucial for many downstream tasks. Empirically, GLM substantially outperforms BERT on the SuperGLUE natural language understanding benchmark with the same amount of pre-training data. Moreover, GLM with 1.25× parameters of BERTLarge achieves the best performance in NLU, conditional and unconditional generation at the same time, which demonstrates its generalizability to different downstream tasks.1 Equal contribution Department of Computer Science and Technology, Tsinghua Univerisity, Beijing, China Beijing Academy of Artificial Intelligence, Beijing, China Massachusetts Institute of Technology, Cambridge, U.S.A. Recurrent AI, Ltd.. Correspondence to: Zhilin Yang <kimi_yang@rcrai.com>, Jie Tang <jietang@tsinghua.edu.cn>. The codes and pre-trained models are available at https: //github.com/THUDM/GLM All [START] NLP tasks are generation tasks All NLP tasks [END] are generation tasks

[1]  Stefano Soatto,et al.  Structured Prediction as Translation between Augmented Natural Languages , 2021, ICLR.

[2]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[3]  Bing Xiang,et al.  Augmented Natural Language for Generative Sequence Labeling , 2020, EMNLP.

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Chris Donahue,et al.  Enabling Language Models to Fill in the Blanks , 2020, ACL.

[6]  Chenliang Li,et al.  PALM: Pre-training an Autoencoding&autoregressive Language Model for Context-conditioned Generation , 2020, EMNLP.

[7]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[8]  Regina Barzilay,et al.  Blank Language Models , 2020, EMNLP.

[9]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[10]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[11]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[12]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[13]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[14]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[15]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[16]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[17]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[18]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[19]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[20]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[21]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[22]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[25]  Angeliki Lazaridou,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[26]  Alexander M. Rush,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[27]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[31]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[34]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[35]  Ido Dagan,et al.  The PASCAL Recognising Textual Entailment Challenge , 2005, MLCW.