论文信息 - PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing

PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing

Edge device deep neural network (DNN) training is practical to improve model adaptivity for unfamiliar datasets while avoiding privacy disclosure and huge communication cost. Nevertheless, apart from feed-forward (FF) as inference, DNN training still requires back-propagation (BP) and weight gradient (WG), introducing power-consuming floating-point computing requirements, hardware underutilization, and energy bottleneck from excessive memory access. This paper proposes a DNN training processor named PL-NPU to solve the above challenges with three innovations. First, a posit-based logarithm-domain processing element (PE) adapts to various training data requirements with a low bit-width format and reduces energy by transferring complicated arithmetics into simple logarithm domain operation. Second, a reconfigurable inter-intra-channel-reuse dataflow dynamically adjusts the PE mapping with a regrouping omega network to improve the operands reuse for higher hardware utilization. Third, a pointed-stake-shaped codec unit adaptively compresses small values to variable-length data format while compressing large values to fixed-length 8b posit format, reducing the memory access for breaking the training energy bottleneck. Simulated with 28nm CMOS technology, the proposed PL-NPU achieves a maximum frequency of 1040MHz with 343mW and 5.28mm<inline-formula> <tex-math notation="LaTeX">$\mathbf {^{2}}$ </tex-math></inline-formula>. The peak energy efficiency is 3.87TFLOPS/W for 0.6V at 60MHz. Compared with the state-of-the-art training processor, PL-NPU reaches <inline-formula> <tex-math notation="LaTeX">$3.75\times $ </tex-math></inline-formula> higher energy efficiency and offers <inline-formula> <tex-math notation="LaTeX">$1.68\times $ </tex-math></inline-formula> speedup when training ResNet18.

[1] Lee-Sup Kim,et al. A Deep Neural Network Training Architecture With Inference-Aware Heterogeneous Data-Type , 2022, IEEE Transactions on Computers.

[2] Sunwoo Lee,et al. A Neural Network Training Processor With 8-Bit Shared Exponent Bias Floating Point and Multiple-Way Fused Multiply-Add Trees , 2022, IEEE Journal of Solid-State Circuits.

[3] Christopher De Sa,et al. How Low Can We Go: Trading Memory for Error in Low-Precision Training , 2021, ICLR.

[4] Yang Wang,et al. A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation , 2021, 2021 Symposium on VLSI Circuits.

[5] Joel Silberman,et al. A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[6] Longxing Shi,et al. A 22nm, 10.8 μ W/15.1 μ W Dual Computing Modes High Power-Performance-Area Efficiency Domained Background Noise Aware Keyword- Spotting Processor , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[7] Jack Choquette,et al. NVIDIA A100 GPU: Performance & Innovation for GPU Computing , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[8] Seungkyu Choi,et al. An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices , 2020, IEEE Journal of Solid-State Circuits.

[9] Swagath Venkataramani,et al. A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference , 2020, 2020 IEEE Symposium on VLSI Circuits.

[10] Jun Lin,et al. Evaluations on Deep Neural Networks Training Using Posit Number System , 2020, IEEE Transactions on Computers.

[11] Hoi-Jun Yoo,et al. 7.4 GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[12] Mehdi Kamal,et al. Res-DNN: A Residue Number System-Based DNN Accelerator Unit , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[13] Anahita Bhiwandiwalla,et al. Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , 2020, ICLR.

[14] Florent de Dinechin,et al. Evaluating the Hardware Cost of the Posit Number System , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[15] Dhireesha Kudithipudi,et al. Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge , 2019, ArXiv.

[16] Christopher De Sa,et al. SWALP : Stochastic Weight Averaging in Low-Precision Training , 2019, ICML.

[17] Thomas G. Dietterich,et al. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2019, ICLR.

[18] Yann LeCun,et al. 1.1 Deep Learning Hardware: Past, Present, and Future , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[19] Hoi-Jun Yoo,et al. 7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[20] Youngwoo Kim,et al. A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[21] Daniel Brand,et al. Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[22] Jeff Johnson. Rethinking floating point for deep learning , 2018, ArXiv.

[23] Lee-Sup Kim,et al. TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training , 2018, ISLPED.

[24] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[25] Shuang Wu,et al. Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[26] Jing Wang,et al. In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[28] John L. Gustafson,et al. Beating Floating Point at its Own Game: Posit Arithmetic , 2017, Supercomput. Front. Innov..

[29] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[30] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[31] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[32] Trevor Darrell,et al. Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Dumitru Erhan,et al. Deep Neural Networks for Object Detection , 2013, NIPS.

[34] Jean-Michel Muller,et al. Floating-point arithmetic , 2023, Acta Numerica.

[35] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Jirí Kadlec,et al. Arithmetic on the European Logarithmic Microprocessor , 2000, IEEE Trans. Computers.

[37] Nicholas J. Higham,et al. The Accuracy of Floating Point Summation , 1993, SIAM J. Sci. Comput..

[38] Jeffrey Scott Vitter,et al. Design and analysis of dynamic Huffman codes , 1987, JACM.

[39] F. Taylor,et al. An extended precision logarithmic number system , 1983 .

[40] Robert F. Rice,et al. Some practical universal noiseless coding techniques , 1979 .

[41] Duncan H. Lawrie,et al. Access and Alignment of Data in an Array Processor , 1975, IEEE Transactions on Computers.

[42] Solomon W. Golomb,et al. Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[43] Sunwoo Lee,et al. Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization , 2022, ICLR.

[44] Swagath Venkataramani,et al. Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[45] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .