论文信息 - BitX: Empower Versatile Inference with Hardware Runtime Pruning

BitX: Empower Versatile Inference with Hardware Runtime Pruning

Classic DNN pruning mostly leverages software-based methodologies to tackle the accuracy/speed tradeoff, which involves complicated procedures like critical parameter searching, fine-tuning and sparse training to find the best plan. In this paper, we explore the opportunities of hardware runtime pruning and propose a hardware runtime pruning methodology, termed as “BitX” to empower versatile DNN inference. It targets the abundant useless bits in the parameters, pinpoints and prunes these bits on-the-fly in the proposed BitX accelerator. The versatility of BitX lies in: (1) software effortless; (2) orthogonal to the software-based pruning; and (3) multi-precision support (including both floating point and fixed point). Empirical studies on image classification and object detection models highlight the following results: (1) up to 4.82x speedup over the original non-pruned DNN and 14.76x speedup collaborated with the software-pruned DNN; (2) up to 0.07% and 0.9% higher accuracy for the floating-point and fixed-point DNN, respectively; (3) 2.00x and 3.79x performance improvement over the state-of-the-art accelerators, with 0.039 mm2 and 68.62 mW (floating-point 32), 36.41 mW(16-bit fixed point) power consumption under TSMC 28 nm technology library.

[1] Patrick Judd,et al. Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks , 2019, ASPLOS.

[2] Rui Peng,et al. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[3] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5] Xin Wei,et al. Tetris: Re-architecting Convolutional Neural Network Computation for Machine Learning Accelerators , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[6] Mingzhe Zhang,et al. Architecting Effectual Computation for Machine Learning Accelerators , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7] Wonyong Sung,et al. Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..

[8] Hoi-Jun Yoo,et al. UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[11] Zhiqiang Shen,et al. Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12] Ali Farhadi,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.

[13] Petros Drineas,et al. FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[14] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.

[15] Song Han,et al. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[16] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17] Tianshi Chen,et al. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Yi Li,et al. Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20] Andreas Moshovos,et al. Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24] Jianxin Wu,et al. An Entropy-based Pruning Method for CNN Compression , 2017, ArXiv.

[25] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[26] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[27] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[28] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[29] Jianxin Wu,et al. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[31] Norbert Wehn,et al. DRAMSys: A Flexible DRAM Subsystem Design Space Exploration Framework , 2015, IPSJ Trans. Syst. LSI Des. Methodol..