Full Deep Neural Network Training On A Pruned Weight Budget

We introduce a DNN training technique that learns only a fraction of the full parameter set without incurring an accuracy penalty. To do this, our algorithm constrains the total number of weights updated during backpropagation to those with the highest total gradients. The remaining weights are not tracked, and their initial value is regenerated at every access to avoid storing them in memory. This can dramatically reduce the number of off-chip memory accesses during both training and inference, a key component of the energy needs of DNN accelerators. By ensuring that the total weight diffusion remains close to that of baseline unpruned SGD, networks pruned using our technique are able to retain state-of-the-art accuracy across network architectures -- including networks previously identified as difficult to compress, such as Densenet and WRN. With ResNet18 on ImageNet, we observe an 11.7$\times$ weight reduction with no accuracy loss, and up to 24.4$\times$ with a small accuracy impact.

[1]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[2]  Xin Jin,et al.  Compressing deep neural networks for efficient visual inference , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[4]  Joost van de Weijer,et al.  Domain-Adaptive Deep Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[6]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[7]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[8]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[9]  Max Welling,et al.  Soft Weight-Sharing for Neural Network Compression , 2017, ICLR.

[10]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[11]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[12]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[13]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[14]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[15]  Carlo Luschi,et al.  Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[16]  Jungwon Lee,et al.  Towards the Limit of Network Quantization , 2016, ICLR.

[17]  Eriko Nurvitadhi,et al.  WRPN: Wide Reduced-Precision Networks , 2017, ICLR.

[18]  Song Han,et al.  DSD: Dense-Sparse-Dense Training for Deep Neural Networks , 2016, ICLR.

[19]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[20]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[21]  Philipp Gysel,et al.  Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[22]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Chi-Ying Tsui,et al.  SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[27]  Yann LeCun,et al.  Effiicient BackProp , 1996, Neural Networks: Tricks of the Trade.

[28]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[30]  Mathieu Salzmann,et al.  Compression-aware Training of Deep Networks , 2017, NIPS.

[31]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[32]  Patrick Judd,et al.  Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing , 2017, ArXiv.

[33]  Paris Smaragdis,et al.  NoiseOut: A Simple Way to Prune Neural Networks , 2016, ArXiv.

[34]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[35]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[36]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[37]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[38]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[39]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[40]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[41]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[42]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Yoshua Bengio,et al.  Big Neural Networks Waste Capacity , 2013, ICLR.

[45]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[49]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50]  Patrice Y. Simard,et al.  Backpropagation without Multiplication , 1993, NIPS.

[51]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[53]  Hanan Samet,et al.  Training Quantized Nets: A Deeper Understanding , 2017, NIPS.

[54]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[55]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .