Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner

Language models (LMs) trained on vast quantities of unlabelled data have greatly advanced the field of natural language processing (NLP). In this study, we re-visit the widely accepted notion in NLP that continued pre-training LMs on task-related texts improves the performance of fine-tuning (FT) in downstream tasks. Through experiments on eight single-sentence tasks and eight sentence-pair tasks in both semi-supervised and fully-supervised settings, we find that conventional continued pre-training does not consistently provide benefits and can even be detrimental for sentence-pair tasks or when prompt-based FT is used. To tackle these issues, we propose Prompt-based Continued Pre-training (PCP), which combines the idea of instruction tuning with conventional continued pre-training. Our approach aims to improve the performance of prompt-based FT by presenting both task-related texts and prompt templates to LMs through unsupervised pre-training objectives before fine-tuning for the target task. Our empirical evaluations on 21 benchmarks demonstrate that the PCP consistently improves the performance of state-of-the-art prompt-based FT approaches (up to 20.1% absolute) in both semi-supervised and fully-supervised settings, even with only hundreds of unlabelled examples. Additionally, prompt-based FT with the PCP outperforms state-of-the-art semi-supervised approaches with greater simplicity, eliminating the need for an iterative process and extra data augmentation. Our further analysis explores the performance lower bound of the PCP and reveals that the advantages of PCP persist across different sizes of models and datasets.

[1]  L. Ein-Dor,et al.  Zero-Shot Text Classification with Self-Training , 2022, EMNLP.

[2]  Zejiang Hou,et al.  Meta-Learning the Difference: Preparing Large Language Models for Efficient Adaptation , 2022, Transactions of the Association for Computational Linguistics.

[3]  Matthew E. Peters,et al.  ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts , 2022, EMNLP.

[4]  Yue Feng,et al.  Learning to Execute Actions or Ask Clarification Questions , 2022, NAACL-HLT.

[5]  Qiang Zhang,et al.  StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts , 2022, AAAI.

[6]  Noah A. Smith,et al.  Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.

[7]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[8]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[9]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[10]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[11]  Jianmin Wang,et al.  Debiased Self-Training for Semi-Supervised Learning , 2022, NeurIPS.

[12]  Shuohang Wang,et al.  AdaPrompt: Adaptive Model Training for Prompt-based NLP , 2022, EMNLP.

[13]  Jordan Massiah,et al.  PARS: Pseudo-Label Aware Robust Sample Selection for Learning with Noisy Labels , 2022, ArXiv.

[14]  Sanket Vaibhav Mehta,et al.  ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning , 2021, ArXiv.

[15]  Jie Zhou,et al.  On Transferability of Prompt Tuning for Natural Language Processing , 2021, NAACL.

[16]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[17]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[18]  Brian Lester,et al.  SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer , 2021, ACL.

[19]  Minlie Huang,et al.  PPT: Pre-trained Prompt Tuning for Few-shot Learning , 2021, ACL.

[20]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[21]  Fei Huang,et al.  Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners , 2021, ICLR.

[22]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[23]  Nicholas Carlini,et al.  AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation , 2021, ICLR.

[24]  Hannaneh Hajishirzi,et al.  Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.

[25]  Loïc Barrault,et al.  On the Importance of Effectively Adapting Pretrained Language Models for Active Learning , 2021, ACL.

[26]  T. Shinozaki,et al.  FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling , 2021, NeurIPS.

[27]  Xifeng Yan,et al.  Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding , 2021, EMNLP.

[28]  Rong Jin,et al.  Dash: Semi-Supervised Learning with Dynamic Thresholding , 2021, ICML.

[29]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[30]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[31]  Guanghui Qin,et al.  Learning How to Ask: Querying LMs with Mixtures of Soft Prompts , 2021, NAACL.

[32]  Alexander M. Rush,et al.  How many data points is a prompt worth? , 2021, NAACL.

[33]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[34]  Mingsheng Long,et al.  Self-Tuning for Data-Efficient Deep Learning , 2021, ICML.

[35]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[36]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[37]  Karen Hambardzumyan,et al.  WARP: Word-level Adversarial ReProgramming , 2021, ACL.

[38]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[39]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[40]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[41]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[42]  Jihong Ouyang,et al.  Semi-Supervised Text Classification with Balanced Deep Representation Distributions , 2021, ACL.

[43]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[44]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[47]  David Berthelot,et al.  ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring , 2020, ICLR.

[48]  Diyi Yang,et al.  MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification , 2020, ACL.

[49]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[50]  Junnan Li,et al.  DivideMix: Learning with Noisy Labels as Semi-supervised Learning , 2020, ICLR.

[51]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[52]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[53]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[55]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[56]  Mirella Lapata,et al.  Semi-Supervised Semantic Role Labeling with Cross-View Training , 2019, EMNLP.

[57]  Gerard de Melo,et al.  A Robust Self-Learning Framework for Cross-Lingual Text Classification , 2019, EMNLP.

[58]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[59]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[60]  Noah A. Smith,et al.  Variational Pretraining for Semi-supervised Text Classification , 2019, ACL.

[61]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[62]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[63]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[64]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[65]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[66]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[68]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[69]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[70]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[71]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[72]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[73]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[74]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[75]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[76]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[77]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[78]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[79]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[80]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[81]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[82]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[83]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[84]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[85]  O. Chapelle,et al.  Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , 2009, IEEE Transactions on Neural Networks.

[86]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[87]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[88]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[89]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[90]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[91]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[92]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[93]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[94]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[95]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[96]  Ellen M. Voorhees,et al.  Building a question answering test collection , 2000, SIGIR '00.

[97]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.