Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning

Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, CHILD-TUNING, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that CHILDTUNING consistently outperforms the vanilla fine-tuning by 1.5 ∼ 8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6 ∼ 1.3 points. Furthermore, empirical results on domain transfer and task transfer show that CHILD-TUNING can obtain better generalization performance by large margins.

[1]  Kilian Q. Weinberger,et al.  Revisiting Few-sample BERT Fine-tuning , 2020, ArXiv.

[2]  Hossein Mobahi,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ArXiv.

[3]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[4]  Martin Jaggi,et al.  Dynamic Model Pruning with Feedback , 2020, ICLR.

[5]  Marius Mosbach,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[6]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[7]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[8]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[11]  Yu Cao,et al.  Ranking the parameters of deep neural networks using the fisher information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[15]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16]  Jian Zhang,et al.  Natural Language Inference over Interaction Space , 2017, ICLR.

[17]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[18]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[19]  Xu Sun,et al.  Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption , 2020, ArXiv.

[20]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[21]  Peter Clark,et al.  SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.

[22]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[23]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Armen Aghajanyan,et al.  Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.

[26]  Yonatan Belinkov,et al.  Variational Information Bottleneck for Effective Low-Resource Fine-Tuning , 2021, ICLR.

[27]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[28]  Sebastian Ruder,et al.  Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks , 2021, ACL.

[29]  Kyunghyun Cho,et al.  Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2020, ICLR.

[30]  Yu Cao,et al.  Reducing the Model Order of Deep Neural Networks Using Information Theory , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[31]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[32]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[33]  Alexander M. Rush,et al.  Parameter-Efficient Transfer Learning with Diff Pruning , 2021, ACL/IJCNLP.

[34]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[36]  Wanxiang Che,et al.  Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[37]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[38]  Massimiliano Pontil,et al.  Distance-Based Regularisation of Deep Networks for Fine-Tuning , 2021, ICLR.