Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs

Deep Neural Networks (DNNs) have emerged as an important class of machine learning algorithms, providing accurate solutions to a broad range of applications. Sparsity in activation maps in DNN training presents an opportunity to reduce computations. However, exploiting activation sparsity presents two major challenges: i) profiling activation sparsity during training comes with significant overhead due to computing the degree of sparsity and the data movement; ii) the dynamic nature of activation maps requires dynamic dense-to-sparse conversion during training, leading to significant overhead. In this article, we present Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmer-transparent microarchitectural solution to exploit activation sparsity detected during training. Spartan provides an efficient sparsity monitor, a tile-based sparse GEMM algorithm, and a novel compaction engine designed for GPU workloads. Spartan can reduce sparsity profiling overhead by 52.5× on average. For the most compute-intensive layers, i.e., convolutional layers, we can speedup AlexNet by 3.4×, VGGNet-16 by 2.14×, and ResNet-18 by 2.02×, when training on the ImageNet dataset.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chia-Lin Yang,et al.  Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[6]  David R. Kaeli,et al.  Evaluating Performance Tradeoffs on the Radeon Open Compute Platform , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Yuan Xie,et al.  Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Edition , 2018, Xenophanes von Kolophon.

[15]  Peter A. Beerel,et al.  Pre-Defined Sparse Neural Networks With Hardware Acceleration , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[16]  Chao Liu,et al.  MIOpen: An Open Source Library For Deep Learning Primitives , 2019, Proceedings of the 30th International Conference on Computer Graphics and Machine Vision (GraphiCon 2020). Part 2.

[17]  David R. Kaeli,et al.  Characterizing the Microarchitectural Implications of a Convolutional Neural Network (CNN) Execution on GPUs , 2018, ICPE.

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Yong Wang,et al.  SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[23]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[25]  Chen Zhang,et al.  Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity , 2019, FPGA.

[26]  Bin Yang,et al.  SBNet: Sparse Blocks Network for Fast Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[29]  Amos J. Storkey,et al.  Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[30]  John R. Gilbert,et al.  Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[31]  David R. Kaeli,et al.  Heterogeneous Computing with OpenCL - Revised OpenCL 1.2 Edition , 2012 .

[32]  Jung Ho Ahn,et al.  The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[33]  Weixing Ji,et al.  A Systematic Survey of General Sparse Matrix-Matrix Multiplication , 2020, ArXiv.

[34]  David R. Kaeli,et al.  DNNMark: A Deep Neural Network Benchmark Suite for GPUs , 2017, GPGPU@PPoPP.

[35]  Luca Benini,et al.  Extended Bit-Plane Compression for Convolutional Neural Network Accelerators , 2018, 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[36]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37]  Zhongliang Chen,et al.  MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[38]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[40]  Ester M. Garzón,et al.  Improving the Performance of the Sparse Matrix Vector Product with GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[41]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.