HPPU: An Energy-Efficient Sparse DNN Training Processor with Hybrid Weight Pruning

Enlightened by the fact that deep-neural-networks (DNNs) are typically highly over-parameterized, weight-pruning-based sparse training (ST) becomes a practical method to reduce training computation and compress models. However, the previous pruning algorithms are either with a coarse-grained pattern or a fine-grained pattern. They lead to a limited pruning ratio or a drastically irregular sparsity distribution, which is computation-intensive or logic-complex for hardware implementation. Meanwhile, the current DNN processors focus on sparse inference but cannot support emerging ST techniques. This paper proposes a co-design approach where the algorithm is adapted to suit the hardware constraints and the hardware exploit the algorithm property to accelerate sparse training. We first present a novel pruning algorithm, hybrid weight pruning, including channel-wise and line-wise pruning. It reaches a considerable pruning ratio while maintaining the hardware friendly property. Then we design a hardware architecture, Hybrid Pruning Processing Unit, HPPU, to accelerate the proposed algorithm. It develops a 2-level active data selector and a sparse convolution engine, which maximize hardware utilization when handling the hybrid sparsity patterns during training. We evaluate HPPU by synthesizing it with 28nm CMOS technology. HPPU achieves 50.1% higher pruning ratio than coarse-grained pruning and 1.53× higher energy-efficiency than fine-grained pruning. The peak energy-efficiency of HPPU is 126.04TFLOPs/W, outperforming state-of-the-art trainable processor GANPU 1.67×. When training a ResNet18 model, HPPU consumes 3.72× less energy and offers 4.69× speedup, and maintains unpruned accuracy.

[1]  Pierre-Marc Jodoin,et al.  Structured Pruning of Neural Networks With Budget-Aware Regularization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[3]  Meng-Fan Chang,et al.  Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers , 2018, 2018 IEEE Symposium on VLSI Circuits.

[4]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Tao Li,et al.  Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[6]  Hoi-Jun Yoo,et al.  7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[7]  Hoi-Jun Yoo,et al.  A 146.52 TOPS/W Deep-Neural-Network Learning Processor with Stochastic Coarse-Fine Pruning and Adaptive Input/Output/Weight Skipping , 2020, 2020 IEEE Symposium on VLSI Circuits.

[8]  Hoi-Jun Yoo,et al.  7.4 GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).