AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0 . 23% of a pre-trained language model’s parameters, our model 1 outperforms the full model fine-tuning performance and several competing methods. The performance of each model is reported after fixed number of training epochs. For a fair comparison, we use the same set of few-shot labeled instances for training as in Wang et al. (2021). We train each model with 5 different seeds and report average performance with standard deviation across the runs. In the few-shot experiments, we follow Wang et al. (2021) to train AdaMix via the prompt-based fine-tuning strategy. In contrast to Wang et al. (2021), we do not use any unlabeled data.

[1]  Ari S. Morcos,et al.  Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[2]  Colin Raffel,et al.  Merging Models with Fisher-Weighted Averaging , 2021, NeurIPS.

[3]  Amjad Almahairi,et al.  UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning , 2021, ACL.

[4]  T. Zhao,et al.  Taming Sparsely Activated Transformer with Stochastic Experts , 2021, ICLR.

[5]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[6]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[7]  Jason Weston,et al.  Hash Layers For Large Sparse Models , 2021, NeurIPS.

[8]  Xianyan Jia,et al.  M6-T: Exploring Sparse Expert Models and Beyond Anonymous ACL submission , 2021 .

[9]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[10]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[11]  Naman Goyal,et al.  BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[12]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[13]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[14]  Armen Aghajanyan,et al.  Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , 2020, ACL.

[15]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[16]  Iryna Gurevych,et al.  AdapterHub: A Framework for Adapting Transformers , 2020, EMNLP.

[17]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[18]  Kilian Q. Weinberger,et al.  Revisiting Few-sample BERT Fine-tuning , 2020, ICLR.

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Iryna Gurevych,et al.  AdapterFusion: Non-Destructive Task Composition for Transfer Learning , 2020, EACL.

[21]  Daniel M. Roy,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[22]  Jimmy J. Lin,et al.  What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning , 2019, ArXiv.

[23]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[26]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[27]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[30]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[31]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[33]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[34]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[35]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[36]  Jianfeng Gao,et al.  LiST: Lite Self-training Makes Efficient Few-shot Learners , 2021, ArXiv.

[37]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[40]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[41]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.