Error-aware Quantization through Noise Tempering

Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with respect to the end task while simulating quantization error, leading to better performance than post-training quantization. Approximation of gradients through the non-differentiable quantization operator is typically achieved using the straight-through estimator (STE) or additive noise. However, STE-based methods suffer from instability due to biased gradients, whereas existing noise-based methods cannot reduce the resulting variance. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. We show this method combines gradient scale and quantization noise in a better optimized way, providing finer-grained estimation of gradients at each weight and activation layer’s quantizer bin size. Our controlled noise also contains an implicit curvature term that could encourage flatter minima, which we show is indeed the case in our experiments. Experiments training ResNet architectures on the CIFAR-10, CIFAR-100 and ImageNet benchmarks show that our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.

[1]  Sanket Vaibhav Mehta,et al.  An Empirical Investigation of the Role of Pre-training in Lifelong Learning , 2021, J. Mach. Learn. Res..

[2]  Eric P. Xing,et al.  Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ross Wightman,et al.  ResNet strikes back: An improved training procedure in timm , 2021, ArXiv.

[4]  Gabriel Synnaeve,et al.  Differentiable Model Compression via Pseudo Quantization Noise , 2021, Trans. Mach. Learn. Res..

[5]  Michael W. Mahoney,et al.  A Survey of Quantization Methods for Efficient Neural Network Inference , 2021, Low-Power Computer Vision.

[6]  Kohei Yamamoto,et al.  Learnable Companding Quantization for Accurate Low-bit Neural Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yang Yang,et al.  BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction , 2021, ICLR.

[8]  Soham De,et al.  On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.

[9]  Erich Elsen,et al.  On the Generalization Benefit of Noise in Stochastic Gradient Descent , 2020, ICML.

[10]  James O'Neill An Overview of Neural Network Compression , 2020, ArXiv.

[11]  Rana Ali Amjad,et al.  Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.

[12]  Jinwon Lee,et al.  LSQ+: Improving low-bit quantization through learnable offsets and better initialization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[14]  Michael W. Mahoney,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  A. Bronstein,et al.  Loss aware post-training quantization , 2019, Machine Learning.

[16]  Rémi Gribonval,et al.  And the Bit Goes Down: Revisiting the Quantization of Neural Networks , 2019, ICLR.

[17]  T. Kemp,et al.  Mixed Precision DNNs: All you need is a good parametrization , 2019, ICLR.

[18]  C. Dick,et al.  Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks , 2019, MLSys.

[19]  Jack Xin,et al.  Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets , 2019, ICLR.

[20]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[21]  Charbel Sakr,et al.  Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm , 2018, ICLR.

[22]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Avi Mendelson,et al.  NICE: Noise Injection and Clamping Estimation for Neural Network Quantization , 2018, Mathematics.

[24]  Steven K. Esser,et al.  Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference , 2018, ArXiv.

[25]  Jae-Joon Han,et al.  Joint Training of Low-Precision Neural Network with Quantization Interval Parameters , 2018, ArXiv.

[26]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[27]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[28]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[30]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[31]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[32]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[33]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[34]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[35]  S. Taheri,et al.  New methods of reducing the phase quantization error effects on beam pointing and parasitic side lobe level of the phased array antennas , 2006, 2006 Asia-Pacific Microwave Conference.

[36]  M. Smith,et al.  A comparison of methods for randomizing phase quantization errors in phased arrays , 1983 .

[37]  Arwen V. Bradley,et al.  How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization? , 2021, ArXiv.