论文信息 - Movement Pruning: Adaptive Sparsity by Fine-Tuning

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Alexander M. Rush | Victor Sanh | Thomas Wolf

[1] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2] Ming-Wei Chang,et al. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[3] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[4] Ji Liu,et al. Global Sparse Momentum SGD for Pruning Very Deep Neural Networks , 2019, NeurIPS.

[5] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[7] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[8] Preslav Nakov,et al. Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[9] Yuandong Tian,et al. Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP , 2019, ICLR.

[10] Ali Farhadi,et al. What’s Hidden in a Randomly Weighted Neural Network? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] J. Scott McCarley,et al. Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[12] Yanzhi Wang,et al. Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[13] Lucas Theis,et al. Faster gaze prediction with dense networks and Fisher pruning , 2018, ArXiv.

[14] Jimmy J. Lin,et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[15] Gintare Karolina Dziugaite,et al. Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[16] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[17] Julien Mairal,et al. Structured sparsity through convex optimization , 2011, ArXiv.

[18] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[19] Anupam Datta,et al. Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[20] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[21] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[22] Sameer Singh,et al. Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[23] Max Welling,et al. Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[24] Yurong Chen,et al. Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[25] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[26] Ziheng Wang,et al. Structured Pruning of Large Language Models , 2019, EMNLP.

[27] Gintare Karolina Dziugaite,et al. The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[28] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[30] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[31] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[33] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[34] Svetlana Lazebnik,et al. Piggyback: Adding Multiple Tasks to a Single, Fixed Network by Learning to Mask , 2018, ArXiv.

[35] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36] Kevin Duh,et al. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[37] Suyog Gupta,et al. To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[38] Edouard Grave,et al. Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[39] Rich Caruana,et al. Model compression , 2006, KDD '06.

[40] Gregory J. Wolff,et al. Optimal Brain Surgeon: Extensions and performance comparisons , 1993, NIPS 1993.

[41] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[42] Song Han,et al. AMC: AutoML for Model Compression and Acceleration on Mobile Devices , 2018, ECCV.

[43] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[44] Yonatan Belinkov,et al. Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , 2020, ArXiv.

[45] Mark Horowitz,et al. 1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[46] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[47] Luke Zettlemoyer,et al. Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[48] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[49] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[50] Thomas Wolf,et al. Transfer Learning in Natural Language Processing , 2019, NAACL.

[51] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[52] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[53] Philip H. S. Torr,et al. SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[54] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.