Neural gradients are near-lognormal: improved quantized and sparse training

While training can mostly be accelerated by reducing the time needed to propagate neural gradients back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy degradation. Reference implementation accompanies the paper.

[1]  C. A. R. Hoare,et al.  Algorithm 65: find , 1961, Commun. ACM.

[2]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[3]  Albert Gural,et al.  Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware , 2019, ArXiv.

[4]  G. Biau,et al.  High-Dimensional \(p\)-Norms , 2013, 1311.0587.

[5]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[6]  Alexander Heinecke,et al.  Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[7]  Anahita Bhiwandiwalla,et al.  Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , 2020, ICLR.

[8]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[11]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[12]  Weisheng Zhao,et al.  Accelerating CNN Training by Sparsifying Activation Gradients , 2019, ArXiv.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[16]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[17]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[18]  Wojciech Samek,et al.  Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19]  Xu Sun,et al.  meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting , 2017, ICML.

[20]  Avi Mendelson,et al.  NICE: Noise Injection and Clamping Estimation for Neural Network Quantization , 2018, Mathematics.

[21]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[22]  David Thorsley,et al.  Post-training Piecewise Linear Quantization for Deep Neural Networks , 2020, ECCV.

[23]  Tor M. Aamodt,et al.  Sparse Weight Activation Training , 2020, NeurIPS.

[24]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[25]  Swagath Venkataramani,et al.  Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN) , 2018, ArXiv.

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Bernard Widrow,et al.  Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications , 2008 .