Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Quantization has become a popular technique to compress neural networks and reduce compute cost, but most prior work focuses on studying quantization without changing the network size. Many real-world applications of neural networks have compute cost and memory budgets, which can be traded off with model quality by changing the number of parameters. In this work, we use ResNet as a case study to systematically investigate the effects of quantization on inference compute cost-quality tradeoff curves. Our results suggest that for each bfloat16 ResNet model, there are quantized models with lower cost and higher ac-curacy; in other words, the bfloat16 compute cost-quality tradeoff curve is Pareto-dominated by the 4-bit and 8-bit curves, with models primarily quantized to 4-bit yielding the best Pareto curve. Furthermore, we achieve state-of-the-art results on ImageNet for 4-bit ResNet-50 with quantization-aware training, obtaining a top-1 eval accuracy of 77.09%. We demonstrate the regularizing effect of quantization by measuring the generalization gap. The quantization method we used is optimized for practicality: It requires little tuning and is designed with hardware capabilities in mind. Our work motivates further research into optimal numeric formats for quantization, as well as the development of machine learning accelerators supporting these formats. As part of this work, we contribute a quantization library written in JAX, which is open-sourced at https://github.com/google-research/google-research/tree/master/aqt.

[1]  Avi Mendelson,et al.  NICE: Noise Injection and Clamping Estimation for Neural Network Quantization , 2018, Mathematics.

[2]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[3]  Joe Lou,et al.  Confounding Tradeoffs for Neural Network Quantization , 2021, ArXiv.

[4]  Eunhui Kim,et al.  Spatial Shift Point-Wise Quantization , 2020, IEEE Access.

[5]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[7]  Fang Liu,et al.  Effective and Fast: A Novel Sequential Single Path Search for Mixed-Precision Quantization , 2021, ArXiv.

[8]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[9]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[10]  A. Krizhevsky Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[11]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[12]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[15]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Dharmendra S. Modha,et al.  Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference , 2018, ArXiv.

[17]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[18]  Zhiru Zhang,et al.  FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations , 2020, FPGA.

[19]  William J. Dally,et al.  VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference , 2021, MLSys.

[20]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Lei Liu,et al.  Auto-tuning Neural Network Quantization Framework for Collaborative Inference Between the Cloud and Edge , 2018, ICANN.

[22]  Kurt Keutzer,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Albert Gural,et al.  Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware , 2019, ArXiv.

[24]  Ian D. Reid,et al.  Towards Effective Low-Bitwidth Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[26]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[27]  Jihun Oh,et al.  Weight Equalizing Shift Scaler-Coupled Post-training Quantization , 2020, ArXiv.

[28]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[29]  Uri Weiser,et al.  Robust Quantization: One Model to Rule Them All , 2020, NeurIPS.

[30]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[33]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.