Only Train Once: A One-Shot Neural Network Training And Pruning Framework

Structured pruning is a commonly used technique in deploying deep neural networks (DNNs) onto resource-constrained devices. However, the existing pruning methods are usually heuristic, task-specified, and require an extra fine-tuning procedure. To overcome these limitations, we propose a framework that compresses DNNs into slimmer architectures with competitive performances and significant FLOPs reductions by Only-Train-Once (OTO). OTO contains two keys: (i) we partition the parameters of DNNs into zero-invariant groups, enabling us to prune zero groups without affecting the output; and (ii) to promote zero groups, we then formulate a structured-sparsity optimization problem and propose a novel optimization algorithm, Half-Space Stochastic Projected Gradient (HSPG), to solve it, which outperforms the standard proximal methods on group sparsity exploration and maintains comparable convergence. To demonstrate the effectiveness of OTO, we train and compress full models simultaneously from scratch without fine-tuning for inference speedup and parameter reduction, and achieve state-of-the-art results on VGG16 for CIFAR10, ResNet50 for CIFAR10 and Bert for SQuAD and competitive result on ResNet50 for ImageNet. The source code is available at https://github.com/tianyic/only_train_once.

[1]  Avinash Sharma,et al.  N2NSkip: Learning Highly Sparse Networks using Neuron-to-Neuron Skip Connections , 2022, BMVC.

[2]  Zhihui Zhu,et al.  CDFI: Compression-Driven Network Design for Frame Interpolation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Y. Bengio,et al.  Structured Sparsity Inducing Adaptive Optimizers for Deep Learning , 2021, ArXiv.

[4]  Chang Xu,et al.  SCOP: Scientific Control for Reliable Neural Network Pruning , 2020, NeurIPS.

[5]  Xiao-Wei Guo,et al.  Pruning Filter in Filter , 2020, NeurIPS.

[6]  Bohyung Han,et al.  Operation-Aware Soft Channel Pruning using Differentiable Masks , 2020, ICML.

[7]  Ji Liu,et al.  Lossless CNN Channel Pruning via Decoupling Remembering and Forgetting , 2020 .

[8]  Jiang Su,et al.  EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning , 2020, ECCV.

[9]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[10]  Ying Wang,et al.  Bayesian Bits: Unifying Quantization and Pruning , 2020, NeurIPS.

[11]  Zhihui Zhu,et al.  Orthant Based Proximal Stochastic Gradient Method for 𝓁1-Regularized Optimization , 2020, ECML/PKDD.

[12]  Luc Van Gool,et al.  Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shuicheng Yan,et al.  Highly Efficient Salient Object Detection with 100K Parameters , 2020, ECCV.

[14]  Lin Xiao,et al.  Statistical Adaptive Stochastic Gradient Methods , 2020, ArXiv.

[15]  S. Jana,et al.  HYDRA: Pruning Adversarially Robust Neural Networks , 2020, NeurIPS.

[16]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[17]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[18]  Tong Zhang,et al.  A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization , 2019, Mathematical Programming.

[19]  Ji Liu,et al.  Automatic Neural Network Compression by Sparsity-Quantization Joint Learning: A Constrained Optimization-Based Approach , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yanzhi Wang,et al.  Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[21]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[22]  Michael Carbin,et al.  Comparing Rewinding and Fine-tuning in Neural Network Pruning , 2019, ICLR.

[23]  Ping Wang,et al.  Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks , 2019, NeurIPS.

[24]  Lin Xiao,et al.  MultiLevel Composite Stochastic Optimization via Nested Variance Reduction , 2019, SIAM J. Optim..

[25]  Wei Wen,et al.  DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures , 2019, ICLR.

[26]  Xiaoyun Zhang,et al.  Depth-Aware Video Frame Interpolation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Gintare Karolina Dziugaite,et al.  Stabilizing the Lottery Ticket Hypothesis , 2019 .

[28]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[29]  Mattan Erez,et al.  PruneTrain: fast neural network training by dynamic sparse model reconfiguration , 2019, SC.

[30]  Xuelong Li,et al.  Towards Compact ConvNets via Structure-Sparsity Regularized Filter Pruning , 2019, ArXiv.

[31]  Rongrong Ji,et al.  Exploiting Kernel Sparsity and Entropy for Interpretable CNN Compression , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[33]  Qi Tian,et al.  Accelerate CNN via Recursive Bayesian Pruning , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[35]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[36]  Yi Yang,et al.  Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks , 2018, IJCAI.

[37]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[38]  Greg Mori,et al.  CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Yanzhi Wang,et al.  A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[40]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[41]  Xuhao Chen,et al.  Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs , 2018, 1802.10280.

[42]  Song Han,et al.  AMC: AutoML for Model Compression and Acceleration on Mobile Devices , 2018, ECCV.

[43]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[45]  Volkan Cevher,et al.  Combinatorial Penalties: Which structures are preserved by convex relaxations? , 2017, AISTATS.

[46]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[47]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Naiyan Wang,et al.  Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[53]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[54]  Alexandre Gramfort,et al.  Gap Safe screening rules for sparsity enforcing penalties , 2016, J. Mach. Learn. Res..

[55]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[56]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[57]  Rui Peng,et al.  Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[58]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[59]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[60]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[63]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[65]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[67]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[68]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[69]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[70]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[71]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[72]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[73]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[74]  Stephen J. Wright,et al.  Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning , 2012, J. Mach. Learn. Res..

[75]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[76]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[77]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[78]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[79]  Zenglin Xu,et al.  Online Learning for Group Lasso , 2010, ICML.

[80]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[81]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[82]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[83]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[84]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[85]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[86]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[87]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[88]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[89]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[90]  Irfan A. Essa,et al.  Graphcut textures: image and video synthesis using graph cuts , 2003, ACM Trans. Graph..

[91]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[92]  KUNIHIKO FUKUSHIMA,et al.  Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements , 1969, IEEE Trans. Syst. Sci. Cybern..

[93]  Sinno Jialin Pan,et al.  Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning , 2020, NeurIPS.

[94]  Yuheng Huang,et al.  Neuron-level Structured Pruning using Polarization Regularizer , 2020, NeurIPS.

[95]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[96]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[97]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .