Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model [1] called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available. 1

[1]  Kuntal Kumar Pal,et al.  Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[2]  Yang Gao,et al.  PSP: Pre-trained Soft Prompts for Few-Shot Abstractive Summarization , 2022, COLING.

[3]  James L. McClelland,et al.  Can language models learn from explanations in context? , 2022, EMNLP.

[4]  Rabeeh Karimi Mahabadi,et al.  Prompt-free and Efficient Few-shot Learning with Language Models , 2022, ACL.

[5]  Serge J. Belongie,et al.  Visual Prompt Tuning , 2022, ECCV.

[6]  Juan Cao,et al.  A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement , 2022, ArXiv.

[7]  Zonghan Yang,et al.  On Robust Prefix-Tuning for Text Classification , 2022, ICLR.

[8]  Huan Sun,et al.  Shepherd Pre-trained Language Models to Develop a Train of Thought: An Iterative Prompting Approach , 2022, ArXiv.

[9]  Weizhu Chen,et al.  Input-Tuning: Adapting Unfamiliar Inputs to Frozen Pretrained Models , 2022, ArXiv.

[10]  Yue Zhang,et al.  Do Prompts Solve NLP Tasks Using Natural Language? , 2022, ArXiv.

[11]  Ed H. Chi,et al.  HyperPrompt: Prompt-based Task-Conditioning of Transformers , 2022, ICML.

[12]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[13]  Orhan Firat,et al.  Using natural language prompts for machine translation , 2022, ArXiv.

[14]  AdaPrompt: Adaptive Model Training for Prompt-based NLP , 2022, 2202.04824.

[15]  Alexander M. Rush,et al.  PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts , 2022, ACL.

[16]  D. Sontag,et al.  Co-training Improves Prompt-based Learning for Large Language Models , 2022, ICML.

[17]  Shizhe Diao,et al.  Black-box Prompt Learning for Pre-trained Language Models , 2022 .

[18]  Jennifer G. Dy,et al.  Learning to Prompt for Continual Learning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yejin Choi,et al.  Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts , 2021, NAACL.

[20]  Timo Schick,et al.  True Few-Shot Learning with Prompts—A Real-World Perspective , 2021, Transactions of the Association for Computational Linguistics.

[21]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[22]  Yi Tay,et al.  The Efficiency Misnomer , 2021, ICLR.

[23]  Brian Lester,et al.  SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer , 2021, ACL.

[24]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[25]  G. Karypis,et al.  Meta-learning via Language Model In-context Tuning , 2021, ACL.

[26]  Graham Neubig,et al.  Towards a Unified View of Parameter-Efficient Transfer Learning , 2021, ICLR.

[27]  Minlie Huang,et al.  PPT: Pre-trained Prompt Tuning for Few-shot Learning , 2021, ACL.

[28]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[29]  Ellie Pavlick,et al.  Do Prompt-Based Models Really Understand the Meaning of Their Prompts? , 2021, NAACL.

[30]  Fei Huang,et al.  Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners , 2021, International Conference on Learning Representations.

[31]  Luke Zettlemoyer,et al.  Noisy Channel Language Model Prompting for Few-Shot Text Classification , 2021, ACL.

[32]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[33]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[34]  Colin Raffel,et al.  Training Neural Networks with Fixed Sparse Masks , 2021, NeurIPS.

[35]  Zhilin Yang,et al.  P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks , 2021, ArXiv.

[36]  Andreas Stuhlmüller,et al.  RAFT: A Real-World Few-Shot Text Classification Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[37]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[38]  R. Zemel,et al.  Learning a Universal Template for Few-shot Dataset Generalization , 2021, ICML.

[39]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[40]  Guanghui Qin,et al.  Learning How to Ask: Querying LMs with Mixtures of Soft Prompts , 2021, NAACL.

[41]  Hakan Bilen,et al.  Universal Representation Learning from Multiple Domains for Few-shot Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Colin Raffel,et al.  Improving and Simplifying Pattern Exploiting Training , 2021, EMNLP.

[43]  Zhilin Yang,et al.  Controllable Generation from Pre-trained Language Models via Inverse Prompting , 2021, KDD.

[44]  Alexander M. Rush,et al.  How many data points is a prompt worth? , 2021, NAACL.

[45]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[46]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[47]  Armen Aghajanyan,et al.  Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , 2020, ACL.

[48]  Alexander M. Rush,et al.  Parameter-Efficient Transfer Learning with Diff Pruning , 2020, ACL.

[49]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[50]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[51]  Maosong Sun,et al.  On Transferability of Prompt Tuning for Natural Language Understanding , 2021, ArXiv.

[52]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[53]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[54]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[55]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[56]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[57]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[58]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[59]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[60]  Judith Tonhauser,et al.  The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[61]  Sebastian Nowozin,et al.  Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes , 2019, NeurIPS.

[62]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[63]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[64]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[65]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[66]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[67]  James Allen,et al.  Tackling the Story Ending Biases in The Story Cloze Test , 2018, ACL.

[68]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[69]  Colin Raffel,et al.  Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[70]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[71]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[72]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[73]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[74]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[75]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.