PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing

Edge device deep neural network (DNN) training is practical to improve model adaptivity for unfamiliar datasets while avoiding privacy disclosure and huge communication cost. Nevertheless, apart from feed-forward (FF) as inference, DNN training still requires back-propagation (BP) and weight gradient (WG), introducing power-consuming floating-point computing requirements, hardware underutilization, and energy bottleneck from excessive memory access. This paper proposes a DNN training processor named PL-NPU to solve the above challenges with three innovations. First, a posit-based logarithm-domain processing element (PE) adapts to various training data requirements with a low bit-width format and reduces energy by transferring complicated arithmetics into simple logarithm domain operation. Second, a reconfigurable inter-intra-channel-reuse dataflow dynamically adjusts the PE mapping with a regrouping omega network to improve the operands reuse for higher hardware utilization. Third, a pointed-stake-shaped codec unit adaptively compresses small values to variable-length data format while compressing large values to fixed-length 8b posit format, reducing the memory access for breaking the training energy bottleneck. Simulated with 28nm CMOS technology, the proposed PL-NPU achieves a maximum frequency of 1040MHz with 343mW and 5.28mm<inline-formula> <tex-math notation="LaTeX">$\mathbf {^{2}}$ </tex-math></inline-formula>. The peak energy efficiency is 3.87TFLOPS/W for 0.6V at 60MHz. Compared with the state-of-the-art training processor, PL-NPU reaches <inline-formula> <tex-math notation="LaTeX">$3.75\times $ </tex-math></inline-formula> higher energy efficiency and offers <inline-formula> <tex-math notation="LaTeX">$1.68\times $ </tex-math></inline-formula> speedup when training ResNet18.

[1]  Lee-Sup Kim,et al.  A Deep Neural Network Training Architecture With Inference-Aware Heterogeneous Data-Type , 2022, IEEE Transactions on Computers.

[2]  Sunwoo Lee,et al.  A Neural Network Training Processor With 8-Bit Shared Exponent Bias Floating Point and Multiple-Way Fused Multiply-Add Trees , 2022, IEEE Journal of Solid-State Circuits.

[3]  Christopher De Sa,et al.  How Low Can We Go: Trading Memory for Error in Low-Precision Training , 2021, ICLR.

[4]  Yang Wang,et al.  A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation , 2021, 2021 Symposium on VLSI Circuits.

[5]  Joel Silberman,et al.  A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[6]  Longxing Shi,et al.  A 22nm, 10.8 μ W/15.1 μ W Dual Computing Modes High Power-Performance-Area Efficiency Domained Background Noise Aware Keyword- Spotting Processor , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[7]  Jack Choquette,et al.  NVIDIA A100 GPU: Performance & Innovation for GPU Computing , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[8]  Seungkyu Choi,et al.  An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices , 2020, IEEE Journal of Solid-State Circuits.

[9]  Swagath Venkataramani,et al.  A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference , 2020, 2020 IEEE Symposium on VLSI Circuits.

[10]  Jun Lin,et al.  Evaluations on Deep Neural Networks Training Using Posit Number System , 2020, IEEE Transactions on Computers.

[11]  Hoi-Jun Yoo,et al.  7.4 GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[12]  Mehdi Kamal,et al.  Res-DNN: A Residue Number System-Based DNN Accelerator Unit , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[13]  Anahita Bhiwandiwalla,et al.  Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , 2020, ICLR.

[14]  Florent de Dinechin,et al.  Evaluating the Hardware Cost of the Posit Number System , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[15]  Dhireesha Kudithipudi,et al.  Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge , 2019, ArXiv.

[16]  Christopher De Sa,et al.  SWALP : Stochastic Weight Averaging in Low-Precision Training , 2019, ICML.

[17]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2019, ICLR.

[18]  Yann LeCun,et al.  1.1 Deep Learning Hardware: Past, Present, and Future , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[19]  Hoi-Jun Yoo,et al.  7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[20]  Youngwoo Kim,et al.  A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[21]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[22]  Jeff Johnson Rethinking floating point for deep learning , 2018, ArXiv.

[23]  Lee-Sup Kim,et al.  TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training , 2018, ISLPED.

[24]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[25]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[26]  Jing Wang,et al.  In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[28]  John L. Gustafson,et al.  Beating Floating Point at its Own Game: Posit Arithmetic , 2017, Supercomput. Front. Innov..

[29]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[30]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[31]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[32]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[34]  Jean-Michel Muller,et al.  Floating-point arithmetic , 2023, Acta Numerica.

[35]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Jirí Kadlec,et al.  Arithmetic on the European Logarithmic Microprocessor , 2000, IEEE Trans. Computers.

[37]  Nicholas J. Higham,et al.  The Accuracy of Floating Point Summation , 1993, SIAM J. Sci. Comput..

[38]  Jeffrey Scott Vitter,et al.  Design and analysis of dynamic Huffman codes , 1987, JACM.

[39]  F. Taylor,et al.  An extended precision logarithmic number system , 1983 .

[40]  Robert F. Rice,et al.  Some practical universal noiseless coding techniques , 1979 .

[41]  Duncan H. Lawrie,et al.  Access and Alignment of Data in an Array Processor , 1975, IEEE Transactions on Computers.

[42]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[43]  Sunwoo Lee,et al.  Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization , 2022, ICLR.

[44]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .