Task-Specific Skill Localization in Fine-tuned Language Models

Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific ``skills,'' but there has been limited study of where these newly-learnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters ($\sim0.01$% of model parameters) responsible for ($>95$%) of the model's performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further re-training is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution ($40$-$90$% error reduction) as well as the quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity. Experiments suggest that localization via grafting can assist certain forms of continual learning.

[1]  Mohit Bansal,et al.  Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , 2023, ArXiv.

[2]  D. Klein,et al.  Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.

[3]  Sébastien Bubeck,et al.  How to Fine-Tune Vision Models with SGD , 2022, arXiv.org.

[4]  Zhengyan Zhang,et al.  Finding Skill Neurons in Pre-trained Transformer-based Language Models , 2022, EMNLP.

[5]  Annie S. Chen,et al.  Surgical Fine-Tuning Improves Adaptation to Distribution Shifts , 2022, ICLR.

[6]  Sashank J. Reddi,et al.  Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers , 2022, ArXiv.

[7]  Sanjeev Arora,et al.  A Kernel-Based View of Language Model Fine-Tuning , 2022, ICML.

[8]  Jie Zhou,et al.  A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models , 2022, NeurIPS.

[9]  David Bau,et al.  Locating and Editing Factual Associations in GPT , 2022, NeurIPS.

[10]  Jong Wook Kim,et al.  Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Li Dong,et al.  Knowledge Neurons in Pretrained Transformers , 2021, ACL.

[12]  Tie-Yan Liu,et al.  Finding the Dominant Winning Ticket in Pre-Trained Language Models , 2022, FINDINGS.

[13]  Sung Ju Hwang,et al.  Forget-free Continual Learning with Winning Subnetworks , 2022, ICML.

[14]  Sang Michael Xie,et al.  Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning , 2021, NeurIPS.

[15]  B. Kailkhura,et al.  A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness , 2021, NeurIPS.

[16]  Boaz Barak,et al.  Revisiting Model Stitching to Compare Neural Representations , 2021, NeurIPS.

[17]  Aaron C. Courville,et al.  Can Subnetwork Structure be the Key to Out-of-Distribution Generalization? , 2021, ICML.

[18]  Wei Wang,et al.  Adapting by Pruning: A Case Study on BERT , 2021, ArXiv.

[19]  Luke Zettlemoyer,et al.  Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[20]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[21]  Sanjeev Arora,et al.  A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks , 2020, ICLR.

[22]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[23]  Yang Zhang,et al.  The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Anna Rumshisky,et al.  When BERT Plays the Lottery, All Tickets Are Winning , 2020, EMNLP.

[26]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[27]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[28]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[29]  Daniel M. Roy,et al.  In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors , 2019, ICML.

[30]  Percy Liang,et al.  Verified Uncertainty Calibration , 2019, NeurIPS.

[31]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[32]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[33]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[34]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[35]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[36]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[37]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[38]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[41]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[42]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[43]  Yu Cheng,et al.  Fully-Adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Philip S. Yu,et al.  Learning Multiple Tasks with Multilinear Relationship Networks , 2015, NIPS.

[45]  Hinrich Schütze,et al.  Table Filling Multi-Task Recurrent Neural Network for Joint Entity and Relation Extraction , 2016, COLING.

[46]  Bing Liu,et al.  Lifelong machine learning: a paradigm for continuous learning , 2017, Frontiers of Computer Science.

[47]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[48]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[49]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.