Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

While the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference i.e. reduction in model size, latency and energy consumption. A technique to limit model size is quantization, i.e. using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a novel method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of ${\mathcal O}(1)~\mu$s is required. Nanosecond inference and a resource consumption reduced by a factor of $50$ when implemented on FPGA hardware is achieved.

[1]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[2]  Kyuyeon Hwang,et al.  Fixed-point feedforward deep neural network design using weights +1, 0, and −1 , 2014, 2014 IEEE Workshop on Signal Processing Systems (SiPS).

[3]  Philip Harris,et al.  Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml , 2020, Mach. Learn. Sci. Technol..

[4]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[5]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[6]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Lingjia Tang,et al.  The Architectural Implications of Autonomous Driving: Constraints and Acceleration , 2018, ASPLOS.

[8]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[9]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[10]  Ke Wang,et al.  AI Benchmark: Running Deep Neural Networks on Android Smartphones , 2018, ECCV Workshops.

[11]  Ahmad Shawahna,et al.  FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review , 2019, IEEE Access.

[12]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[13]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[14]  Eric A. Moreno,et al.  JEDI-net: a jet identification algorithm based on interaction networks , 2019, The European Physical Journal C.

[15]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[16]  Xi Chen,et al.  FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Ji Liu,et al.  Global Sparse Momentum SGD for Pruning Very Deep Neural Networks , 2019, NeurIPS.

[19]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[20]  Hon Keung Kwan,et al.  A design method for multilayer feedforward neural networks for simple hardware implementation , 1993, 1993 IEEE International Symposium on Circuits and Systems.

[21]  Kenneth O'Brien,et al.  FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks , 2018 .

[22]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[23]  Xuegong Zhou,et al.  A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[24]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Christos-Savvas Bouganis,et al.  fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only) , 2017, FPGA.

[26]  Markus Nagel,et al.  Data-Free Quantization Through Weight Equalization and Bias Correction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Eugenio Culurciello,et al.  Snowflake: An efficient hardware accelerator for convolutional neural networks , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[29]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[31]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[32]  Christos-Savvas Bouganis,et al.  Toolflows for Mapping Convolutional Neural Networks on FPGAs , 2018, ACM Comput. Surv..

[33]  Philip Harris,et al.  Fast convolutional neural networks on FPGAs with hls4ml , 2021, Machine Learning: Science and Technology.

[34]  Marian Verhelst,et al.  Minimum energy quantized neural networks , 2017, 2017 51st Asilomar Conference on Signals, Systems, and Computers.

[35]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[37]  Ian D. Reid,et al.  Towards Effective Low-Bitwidth Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[39]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Lucio Rossi,et al.  High-Luminosity Large Hadron Collider (HL-LHC) : Preliminary Design Report , 2015 .

[42]  Kurt Keutzer,et al.  HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks , 2020, NeurIPS.

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[45]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[46]  Kenneth O'Brien,et al.  FINN-R , 2018, ACM Trans. Reconfigurable Technol. Syst..

[47]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[48]  YU WANG,et al.  A Survey of FPGA-Based Neural Network Inference Accelerator , 2019 .

[49]  Yuandong Tian,et al.  Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search , 2018, ArXiv.

[50]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[51]  Alexander Finkelstein,et al.  Same, Same But Different - Recovering Neural Network Quantization Error Through Weight Factorization , 2019, ICML.

[52]  Gianluca Cerminara,et al.  Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics , 2020, Frontiers in Big Data.

[53]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Wayne Luk,et al.  Hardware Compilation of Deep Neural Networks: An Overview , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[55]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[56]  Song Han,et al.  Fast inference of deep neural networks in FPGAs for particle physics , 2018, Journal of Instrumentation.

[57]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs , 2017, ArXiv.

[58]  Yash Akhauri,et al.  LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).

[59]  Maxime Pelcat,et al.  Accelerating CNN inference on FPGAs: A Survey , 2018, ArXiv.

[60]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[61]  Song Han,et al.  HAQ: Hardware-Aware Automated Quantization , 2018, ArXiv.

[62]  Heiner Litz,et al.  High Frequency Trading Acceleration Using FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.