Movement Pruning: Adaptive Sparsity by Fine-Tuning

Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[3]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[4]  Ji Liu,et al.  Global Sparse Momentum SGD for Pruning Very Deep Neural Networks , 2019, NeurIPS.

[5]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[8]  Preslav Nakov,et al.  Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[9]  Yuandong Tian,et al.  Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP , 2019, ICLR.

[10]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  J. Scott McCarley,et al.  Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[12]  Yanzhi Wang,et al.  Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[13]  Lucas Theis,et al.  Faster gaze prediction with dense networks and Fisher pruning , 2018, ArXiv.

[14]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[15]  Gintare Karolina Dziugaite,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[16]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[17]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[18]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[19]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[20]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[21]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[22]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[23]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[24]  Yurong Chen,et al.  Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[25]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[26]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2019, EMNLP.

[27]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[30]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[31]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[33]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[34]  Svetlana Lazebnik,et al.  Piggyback: Adding Multiple Tasks to a Single, Fixed Network by Learning to Mask , 2018, ArXiv.

[35]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[37]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[38]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[39]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[40]  Gregory J. Wolff,et al.  Optimal Brain Surgeon: Extensions and performance comparisons , 1993, NIPS 1993.

[41]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[42]  Song Han,et al.  AMC: AutoML for Model Compression and Acceleration on Mobile Devices , 2018, ECCV.

[43]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[44]  Yonatan Belinkov,et al.  Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , 2020, ArXiv.

[45]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[46]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[47]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[48]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[49]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[50]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[51]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[52]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[53]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[54]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.