论文信息 - Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision

Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision

The success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under >70% and >90% sparsity with 4x1 grain size and half-precision.

[1] Lei Deng,et al. Boosting Deep Neural Network Efficiency with Dual-Module Inference , 2020, ICML.

[2] Torsten Hoefler,et al. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[3] Yiran Chen,et al. Running sparse and low-precision neural network: When algorithm meets hardware , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[4] Yuan Xie,et al. Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs , 2019, MICRO.

[5] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[6] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[7] Erich Elsen,et al. Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[8] Song Han,et al. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks , 2017, ArXiv.

[9] Tor M. Aamodt,et al. Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10] Rajesh K. Gupta,et al. SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[11] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[12] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.

[13] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ArXiv.

[14] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15] Tianqi Wang,et al. UWB-GCN: Hardware Acceleration of Graph-Convolution-Network through Runtime Workload Rebalancing , 2019, ArXiv.

[16] Xiaowen Chu,et al. Optimizing batched winograd convolution on GPUs , 2020, PPoPP.

[17] Eriko Nurvitadhi,et al. Accelerating Deep Convolutional Networks using low-precision and sparsity , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[19] Amanda Amy Harris Houk,et al. Google for Research , 2012 .

[20] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[21] Xiaofan Xu,et al. SQuantizer: Simultaneous Learning for Both Sparse and Low-precision Neural Networks , 2018, ArXiv.

[22] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[23] Lei Deng,et al. fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[24] Jinjun Xiong,et al. Accelerating reduction and scan using tensor core units , 2018, ICS.

[25] Erich Elsen,et al. Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.