HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Yifan Gong,et al.  Low-rank plus diagonal adaptation for deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[4]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[5]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Minyi Guo,et al.  Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[9]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[10]  Ji Li,et al.  Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.

[11]  Anna Rumshisky,et al.  When BERT Plays the Lottery, All Tickets Are Winning , 2020, EMNLP.

[12]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[13]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[14]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[16]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[19]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[20]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lu Lu,et al.  An Efficient Deep Reinforcement Learning Framework for UAVs , 2020, 2020 21st International Symposium on Quality Electronic Design (ISQED).

[23]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[24]  Gregory Frederick Diamos,et al.  Block-Sparse Recurrent Neural Networks , 2017, ArXiv.

[25]  Diederik P. Kingma,et al.  GPU Kernels for Block-Sparse Weights , 2017 .

[26]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[27]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[28]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.