SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor With Dynamic Sub-Structured Weight Pruning

When deploying deep neural networks (DNNs), edge devices training is practical to improve model adaptivity for various user-specific scenarios while avoiding privacy disclosure. However, the training computation is intolerable for edge devices. It inspires sparse DNN training (SDT) into the limelight, which reduces training computation by dynamic weight pruning. Generally, SDT has two strategies based on the pruning granularity: the structured or the unstructured. Unfortunately, both of them suffer from limited training efficiency due to the gap between pruning granularity and hardware implementation. The former is hardware-friendly but has a low pruning ratio, indicating limited computation reduction. The latter has a high pruning ratio, but the unbalanced workload decreases utilization and irregular sparsity distribution causes considerable sparsity processing overhead. This paper proposes a software-hardware co- design to bridge the gap for improving the efficiency of SDT. On the algorithm side, a sub-structured pruning method, achieved with hybrid shape-wise and line-wise pruning, generates a high sparsity ratio and keeps the hardware-friendly property. On the hardware side, a sub-structured weight processing unit (SWPU) effectively handles the hybrid sparsity with three techniques. First, SWPU dynamically reorders the computation sequence with hamming-distance-based clustering, balancing the irregular workload. Second, SWPU performs runtime scheduling by exploiting the feature of sub-structured sparse convolution through a detect-before-load controller, which skips redundant memory access and sparsity processing. Third, SWPU performs sparse convolution by compressing operands with spatial disconnect log-based routing and recovers their location with bi-directional switching, avoiding the power-consumed routing logic. Synthesized with 28nm CMOS technology, SWPU can enable 0.56V-to-1.0V supply voltage with a maximum frequency of 675 MHz. It achieves a 50.1% higher pruning ratio than structured pruning and <inline-formula> <tex-math notation="LaTeX">$1.53\times $ </tex-math></inline-formula> higher energy efficiency than unstructured pruning. The peak energy efficiency of SWPU is 126.04TFLOPS/W, outperforming the state-of-the-art training processor by <inline-formula> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula>. When training a ResNet-18 model, SWPU reduces <inline-formula> <tex-math notation="LaTeX">$3.72\times $ </tex-math></inline-formula> energy and offers <inline-formula> <tex-math notation="LaTeX">$4.69\times $ </tex-math></inline-formula> speedup than previous sparse training processors.

[1]  Yang Wang,et al.  A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation , 2021, 2021 Symposium on VLSI Circuits.

[2]  Sunwoo Lee,et al.  A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared Exponent Bias and 24-Way Fused Multiply-Add Tree , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[3]  Yang Wang,et al.  Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning , 2021, IEEE Journal of Solid-State Circuits.

[4]  Andreas Moshovos,et al.  TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[6]  Hoi-Jun Yoo,et al.  A 146.52 TOPS/W Deep-Neural-Network Learning Processor with Stochastic Coarse-Fine Pruning and Adaptive Input/Output/Weight Skipping , 2020, 2020 IEEE Symposium on VLSI Circuits.

[7]  Hoi-Jun Yoo,et al.  7.4 GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[8]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[9]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tao Li,et al.  Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[11]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[12]  Leonidas J. Guibas,et al.  KPConv: Flexible and Deformable Convolution for Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Leibo Liu,et al.  An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width , 2019, IEEE Journal of Solid-State Circuits.

[14]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[15]  Hoi-Jun Yoo,et al.  7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[16]  Youngwoo Kim,et al.  A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[17]  Meng-Fan Chang,et al.  Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers , 2018, 2018 IEEE Symposium on VLSI Circuits.

[18]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[19]  Jacek M. Zurada,et al.  Building Efficient ConvNets using Redundant Feature Pruning , 2018, ArXiv.

[20]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  David Kappel,et al.  Deep Rewiring: Training very sparse deep networks , 2017, ICLR.

[22]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[23]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[30]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Hoi-Jun Yoo,et al.  PNPU: An Energy-Efficient Deep-Neural-Network Learning Processor With Stochastic Coarse–Fine Level Weight Pruning and Adaptive Input/Output/Weight Zero Skipping , 2021, IEEE Solid-State Circuits Letters.

[35]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .