Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision

The success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under >70% and >90% sparsity with 4x1 grain size and half-precision.

[1]  Lei Deng,et al.  Boosting Deep Neural Network Efficiency with Dual-Module Inference , 2020, ICML.

[2]  Torsten Hoefler,et al.  Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[3]  Yiran Chen,et al.  Running sparse and low-precision neural network: When algorithm meets hardware , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[4]  Yuan Xie,et al.  Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs , 2019, MICRO.

[5]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[6]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[7]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[8]  Song Han,et al.  Exploring the Regularity of Sparse Structure in Convolutional Neural Networks , 2017, ArXiv.

[9]  Tor M. Aamodt,et al.  Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Rajesh K. Gupta,et al.  SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[11]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[12]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[13]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Tianqi Wang,et al.  UWB-GCN: Hardware Acceleration of Graph-Convolution-Network through Runtime Workload Rebalancing , 2019, ArXiv.

[16]  Xiaowen Chu,et al.  Optimizing batched winograd convolution on GPUs , 2020, PPoPP.

[17]  Eriko Nurvitadhi,et al.  Accelerating Deep Convolutional Networks using low-precision and sparsity , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[19]  Amanda Amy Harris Houk,et al.  Google for Research , 2012 .

[20]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[21]  Xiaofan Xu,et al.  SQuantizer: Simultaneous Learning for Both Sparse and Low-precision Neural Networks , 2018, ArXiv.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Lei Deng,et al.  fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[24]  Jinjun Xiong,et al.  Accelerating reduction and scan using tensor core units , 2018, ICS.

[25]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.