Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks

The large computing and memory cost of deep neural networks (DNNs) often precludes their use in resource-constrained devices. Quantizing the parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference, facilitating the use of DNNs on edge computing platforms. Recent efforts at quantizing DNNs have employed a range of techniques encompassing progressive quantization, step-size adaptation, and gradient scaling. This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing. Our method establishes a new pareto frontier in model accuracy and memory footprint demonstrating a range of quantized models, delivering best-in-class accuracy below 4.3 MB of weights (wgts.) and activations (acts.). Our main contributions are: (i) hardware-aware heterogeneous differentiable quantization with tensor-sliced learned precision, (ii) targeted gradient modification for wgts. and acts. to mitigate quantization errors, and (iii) a multi-phase learning schedule to address instability in learning arising from updates to the learned quantizer and model parameters. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14MB of wgts. and acts. at 67.66% accuracy) and MobileNetV2 (e.g., 3.51MB wgts. and acts. at 65.39% accuracy).

[1]  Francesco Conti,et al.  Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge , 2023, 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[2]  E. Macii,et al.  Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference , 2023, ArXiv.

[3]  Francesco Conti,et al.  DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training , 2023, IEEE Open Journal of the Solid-State Circuits Society.

[4]  H. Corporaal,et al.  BrainTTA: A 35 fJ/op Compiler Programmable Mixed-Precision Transport-Triggered NN SoC , 2022, ArXiv.

[5]  D. Golovin,et al.  Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization , 2022, AutoML.

[6]  Yutong Lu,et al.  moTuner: a compiler-based auto-tuning approach for mixed-precision operators , 2022, CF.

[7]  Yichi Zhang,et al.  PokeBNN: A Binary Pursuit of Lightweight Accuracy , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Gert Cauwenberghs,et al.  Edge AI without Compromise: Efficient, Versatile and Accurate Neurocomputing in Resistive Random-Access Memory , 2021, ArXiv.

[9]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cho-Jui Hsieh,et al.  When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ICLR.

[11]  Junghyup Lee,et al.  Network Quantization with Element-wise Gradient Scaling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Berkin Akin,et al.  An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks , 2021, 2022 IEEE International Symposium on Workload Characterization (IISWC).

[13]  William J. Dally,et al.  VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference , 2021, MLSys.

[14]  Kurt Keutzer,et al.  HAWQV3: Dyadic Neural Network Quantization , 2020, ICML.

[15]  Hieu Duy Nguyen,et al.  Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition , 2020, INTERSPEECH.

[16]  Eunhyeok Park,et al.  PROFIT: A Novel Training Method for sub-4-bit MobileNet Models , 2020, ECCV.

[17]  Nojun Kwak,et al.  Position-based Scaled Gradient for Model Quantization and Pruning , 2020, NeurIPS.

[18]  Jinwon Lee,et al.  LSQ+: Improving low-bit quantization through learnable offsets and better initialization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[20]  Aaron C. Courville,et al.  What Do Compressed Deep Neural Networks Forget , 2019, 1911.05248.

[21]  Michael W. Mahoney,et al.  HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks , 2019, NeurIPS.

[22]  Vivienne Sze,et al.  Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[23]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[24]  Xianglong Liu,et al.  Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Yiying Zhang,et al.  "Learned" , 2019, ACM SIGOPS Operating Systems Review.

[26]  T. Kemp,et al.  Mixed Precision DNNs: All you need is a good parametrization , 2019, ICLR.

[27]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[28]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[29]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hadi Esmaeilzadeh,et al.  ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks , 2018, ArXiv.

[31]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[32]  Jae-Joon Han,et al.  Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[34]  Kurt Keutzer,et al.  SqueezeNext: Hardware-Aware Neural Network Design , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[36]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[39]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[43]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[44]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Marian Verhelst,et al.  Survey and Benchmarking of Precision-Scalable MAC Arrays for Embedded DNN Processing , 2021, ArXiv.