论文信息 - Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs

Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs

Deep Neural Networks (DNNs) have emerged as an important class of machine learning algorithms, providing accurate solutions to a broad range of applications. Sparsity in activation maps in DNN training presents an opportunity to reduce computations. However, exploiting activation sparsity presents two major challenges: i) profiling activation sparsity during training comes with significant overhead due to computing the degree of sparsity and the data movement; ii) the dynamic nature of activation maps requires dynamic dense-to-sparse conversion during training, leading to significant overhead. In this article, we present Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmer-transparent microarchitectural solution to exploit activation sparsity detected during training. Spartan provides an efficient sparsity monitor, a tile-based sparse GEMM algorithm, and a novel compaction engine designed for GPU workloads. Spartan can reduce sparsity profiling overhead by 52.5× on average. For the most compute-intensive layers, i.e., convolutional layers, we can speedup AlexNet by 3.4×, VGGNet-16 by 2.14×, and ResNet-18 by 2.02×, when training on the ImageNet dataset.

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Tianshi Chen,et al. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3] Stephen W. Keckler,et al. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Chia-Lin Yang,et al. Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[6] David R. Kaeli,et al. Evaluating Performance Tradeoffs on the Radeon Open Compute Platform , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7] Yuan Xie,et al. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[8] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[9] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[12] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Edition , 2018, Xenophanes von Kolophon.

[15] Peter A. Beerel,et al. Pre-Defined Sparse Neural Networks With Hardware Acceleration , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[16] Chao Liu,et al. MIOpen: An Open Source Library For Deep Learning Primitives , 2019, Proceedings of the 30th International Conference on Computer Graphics and Machine Vision (GraphiCon 2020). Part 2.

[17] David R. Kaeli,et al. Characterizing the Microarchitectural Implications of a Convolutional Neural Network (CNN) Execution on GPUs , 2018, ICPE.

[18] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[19] Scott A. Mahlke,et al. Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20] Amar Phanishayee,et al. Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[21] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22] Yong Wang,et al. SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[23] Dipankar Das,et al. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24] Song Han,et al. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[25] Chen Zhang,et al. Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity , 2019, FPGA.

[26] Bin Yang,et al. SBNet: Sparse Blocks Network for Fast Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[29] Amos J. Storkey,et al. Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[30] John R. Gilbert,et al. Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[31] David R. Kaeli,et al. Heterogeneous Computing with OpenCL - Revised OpenCL 1.2 Edition , 2012 .

[32] Jung Ho Ahn,et al. The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[33] Weixing Ji,et al. A Systematic Survey of General Sparse Matrix-Matrix Multiplication , 2020, ArXiv.

[34] David R. Kaeli,et al. DNNMark: A Deep Neural Network Benchmark Suite for GPUs , 2017, GPGPU@PPoPP.

[35] Luca Benini,et al. Extended Bit-Plane Compression for Convolutional Neural Network Accelerators , 2018, 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[36] Quan Chen,et al. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37] Zhongliang Chen,et al. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[38] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[40] Ester M. Garzón,et al. Improving the Performance of the Sparse Matrix Vector Product with GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[41] Francesco Visin,et al. A guide to convolution arithmetic for deep learning , 2016, ArXiv.