论文信息 - Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks

Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks

The large computing and memory cost of deep neural networks (DNNs) often precludes their use in resource-constrained devices. Quantizing the parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference, facilitating the use of DNNs on edge computing platforms. Recent efforts at quantizing DNNs have employed a range of techniques encompassing progressive quantization, step-size adaptation, and gradient scaling. This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing. Our method establishes a new pareto frontier in model accuracy and memory footprint demonstrating a range of quantized models, delivering best-in-class accuracy below 4.3 MB of weights (wgts.) and activations (acts.). Our main contributions are: (i) hardware-aware heterogeneous differentiable quantization with tensor-sliced learned precision, (ii) targeted gradient modification for wgts. and acts. to mitigate quantization errors, and (iii) a multi-phase learning schedule to address instability in learning arising from updates to the learned quantizer and model parameters. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14MB of wgts. and acts. at 67.66% accuracy) and MobileNetV2 (e.g., 3.51MB wgts. and acts. at 65.39% accuracy).

Clemens J. S. Schaefer | S. Joshi | Shane Li | Raul Blazquez

[1] Francesco Conti,et al. Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge , 2023, 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[2] E. Macii,et al. Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference , 2023, ArXiv.

[3] Francesco Conti,et al. DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training , 2023, IEEE Open Journal of the Solid-State Circuits Society.

[4] H. Corporaal,et al. BrainTTA: A 35 fJ/op Compiler Programmable Mixed-Precision Transport-Triggered NN SoC , 2022, ArXiv.

[5] D. Golovin,et al. Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization , 2022, AutoML.

[6] Yutong Lu,et al. moTuner: a compiler-based auto-tuning approach for mixed-precision operators , 2022, CF.

[7] Yichi Zhang,et al. PokeBNN: A Binary Pursuit of Lightweight Accuracy , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Gert Cauwenberghs,et al. Edge AI without Compromise: Efficient, Versatile and Accurate Neurocomputing in Resistive Random-Access Memory , 2021, ArXiv.

[9] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Cho-Jui Hsieh,et al. When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ICLR.

[11] Junghyup Lee,et al. Network Quantization with Element-wise Gradient Scaling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Berkin Akin,et al. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks , 2021, 2022 IEEE International Symposium on Workload Characterization (IISWC).

[13] William J. Dally,et al. VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference , 2021, MLSys.

[14] Kurt Keutzer,et al. HAWQV3: Dyadic Neural Network Quantization , 2020, ICML.

[15] Hieu Duy Nguyen,et al. Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition , 2020, INTERSPEECH.

[16] Eunhyeok Park,et al. PROFIT: A Novel Training Method for sub-4-bit MobileNet Models , 2020, ECCV.

[17] Nojun Kwak,et al. Position-based Scaled Gradient for Model Quantization and Pruning , 2020, NeurIPS.

[18] Jinwon Lee,et al. LSQ+: Improving low-bit quantization through learnable offsets and better initialization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[20] Aaron C. Courville,et al. What Do Compressed Deep Neural Networks Forget , 2019, 1911.05248.

[21] Michael W. Mahoney,et al. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks , 2019, NeurIPS.

[22] Vivienne Sze,et al. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[23] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[24] Xianglong Liu,et al. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Yiying Zhang,et al. "Learned" , 2019, ACM SIGOPS Operating Systems Review.

[26] T. Kemp,et al. Mixed Precision DNNs: All you need is a good parametrization , 2019, ICLR.

[27] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[28] Steven K. Esser,et al. Learned Step Size Quantization , 2019, ICLR.

[29] Zhijian Liu,et al. HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Hadi Esmaeilzadeh,et al. ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks , 2018, ArXiv.

[31] Daniel Soudry,et al. Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[32] Jae-Joon Han,et al. Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[34] Kurt Keutzer,et al. SqueezeNext: Hardware-Aware Neural Network Design , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35] Swagath Venkataramani,et al. PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[36] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[39] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[43] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[44] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45] Marian Verhelst,et al. Survey and Benchmarking of Precision-Scalable MAC Arrays for Embedded DNN Processing , 2021, ArXiv.