How fine can fine-tuning be? Learning efficient language models

State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces small differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it suffices to fine-tune only the most critical layers. Further, we find that there are surprisingly many good solutions in the set of sparsified versions of the pre-trained model. As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero, saving both task-specific parameter storage and computational cost.

[1]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[2]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[3]  Samy Bengio,et al.  Are All Layers Created Equal? , 2019, J. Mach. Learn. Res..

[4]  Svetlana Lazebnik,et al.  Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights , 2018, ECCV.

[5]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[6]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[8]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[9]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[10]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[11]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[12]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[13]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[14]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[15]  Bruno A. Olshausen,et al.  Superposition of many models into one , 2019, NeurIPS.

[16]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[17]  Barbara Caputo,et al.  Adding New Tasks to a Single Network with Weight Trasformations using Binary Masks , 2018, ECCV Workshops.

[18]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[19]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[20]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[21]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[22]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[23]  Yang Song,et al.  Extreme Language Model Compression with Optimal Subwords and Shared Projections , 2019, ArXiv.

[24]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[27]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[28]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[29]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[30]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.