论文信息 - Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning - 字舞流文

Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning

Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, CHILD-TUNING, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that CHILDTUNING consistently outperforms the vanilla fine-tuning by 1.5 ∼ 8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6 ∼ 1.3 points. Furthermore, empirical results on domain transfer and task transfer show that CHILD-TUNING can obtain better generalization performance by large margins.

Songfang Huang | Baobao Chang | Chuanqi Tan | Fei Huang | Zhiyuan Zhang | Fuli Luo | Runxin Xu | Fei Huang | Songfang Huang | Chuanqi Tan | Baobao Chang | Fuli Luo | Runxin Xu | Zhiyuan Zhang

[1] Kilian Q. Weinberger,et al. Revisiting Few-sample BERT Fine-tuning , 2020, ArXiv.

[2] Hossein Mobahi,et al. Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ArXiv.

[3] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[4] Martin Jaggi,et al. Dynamic Model Pruning with Feedback , 2020, ICLR.

[5] Marius Mosbach,et al. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[6] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[7] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.

[8] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[11] Yu Cao,et al. Ranking the parameters of deep neural networks using the fisher information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Razvan Pascanu,et al. Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[13] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[15] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16] Jian Zhang,et al. Natural Language Inference over Interaction Space , 2017, ICLR.

[17] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[18] Samuel R. Bowman,et al. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[19] Xu Sun,et al. Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption , 2020, ArXiv.

[20] Suyog Gupta,et al. To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[21] Peter Clark,et al. SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.

[22] Marco Marelli,et al. A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[23] Jianfeng Gao,et al. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[24] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25] Armen Aghajanyan,et al. Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.

[26] Yonatan Belinkov,et al. Variational Information Bottleneck for Effective Low-Resource Fine-Tuning , 2021, ICLR.

[27] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[28] Sebastian Ruder,et al. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks , 2021, ACL.

[29] Kyunghyun Cho,et al. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2020, ICLR.

[30] Yu Cao,et al. Reducing the Model Order of Deep Neural Networks Using Information Theory , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[31] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[32] Ali Farhadi,et al. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[33] Alexander M. Rush,et al. Parameter-Efficient Transfer Learning with Diff Pruning , 2021, ACL/IJCNLP.

[34] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35] Iryna Gurevych,et al. MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[36] Wanxiang Che,et al. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[37] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[38] Massimiliano Pontil,et al. Distance-Based Regularisation of Deep Networks for Fine-Tuning , 2021, ICLR.