On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation

Adapter-based tuning has recently arisen as an alternative to fine-tuning. It works by adding light-weight adapter modules to a pretrained language model (PrLM) and only updating the parameters of adapter modules when learning on a downstream task. As such, it adds only a few trainable parameters per new task, allowing a high degree of parameter sharing. Prior studies have shown that adapter-based tuning often achieves comparable results to finetuning. However, existing work only focuses on the parameter-efficient aspect of adapterbased tuning while lacking further investigation on its effectiveness. In this paper, we study the latter. We first show that adapterbased tuning better mitigates forgetting issues than fine-tuning since it yields representations with less deviation from those generated by the initial PrLM. We then empirically compare the two tuning methods on several downstream NLP tasks and settings. We demonstrate that 1) adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks; 2) it is more robust to overfitting and less sensitive to changes in learning rates.

[1]  Armen Aghajanyan,et al.  Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.

[2]  Kilian Q. Weinberger,et al.  Revisiting Few-sample BERT Fine-tuning , 2020, ArXiv.

[3]  Timothy Baldwin,et al.  Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP , 2019, ALTA.

[4]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[5]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[6]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[7]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[8]  Willem Zuidema,et al.  Blackbox Meets Blackbox: Representational Similarity & Stability Analysis of Neural Language Models and Brains , 2019, BlackboxNLP@ACL.

[9]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[10]  Furu Wei,et al.  Visualizing and Understanding the Effectiveness of BERT , 2019, EMNLP.

[11]  Enhong Chen,et al.  Incorporating BERT into Parallel Sequence Decoding with Adapters , 2020, NeurIPS.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[14]  Elahe Rahimtoroghi,et al.  What Happens To BERT Embeddings During Fine-tuning? , 2020, BLACKBOXNLP.

[15]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[16]  Wanxiang Che,et al.  Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[17]  Kyunghyun Cho,et al.  Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2020, ICLR.

[18]  Iryna Gurevych,et al.  AdapterHub: A Framework for Adapting Transformers , 2020, EMNLP.

[19]  Garrison W. Cottrell,et al.  Content and cluster analysis: Assessing representational similarity in neural systems , 2000 .

[20]  Rui Yan,et al.  How Transferable are Neural Networks in NLP Applications? , 2016, EMNLP.

[21]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[22]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[23]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[24]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[25]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[26]  Iryna Gurevych,et al.  AdapterFusion: Non-Destructive Task Composition for Transfer Learning , 2021, EACL.

[27]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[28]  Iain Murray,et al.  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[29]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[30]  Yu Cheng,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.

[31]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[32]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[33]  Yuji Matsumoto,et al.  Universal Dependencies 2.1 , 2017 .

[34]  Iryna Gurevych,et al.  AdapterDrop: On the Efficiency of Adapters in Transformers , 2020, EMNLP.

[35]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[38]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[39]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[40]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41]  Marius Mosbach,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Afra Alishahi,et al.  Correlating Neural and Symbolic Representations of Language , 2019, ACL.

[44]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.