论文信息 - PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination - 字舞流文

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and eliminating the redundant vectors. b) determining which word-vectors to eliminate by developing a strategy for measuring their significance, based on the self-attention mechanism. c) learning how many word-vectors to eliminate by augmenting the BERT model and the loss function. Experiments on the standard GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference time over BERT with <1% loss in accuracy. We show that PoWER-BERT offers significantly better trade-off between accuracy and inference time compared to prior methods. We demonstrate that our method attains up to 6.8x reduction in inference time with <1% loss in accuracy when applied over ALBERT, a highly compressed version of BERT. The code for PoWER-BERT is publicly available at this https URL.

Anamitra R. Choudhury | Yogish Sabharwal | Venkatesan T. Chakaravarthy | Saurabh Goyal | Ashish Verma | Yogish Sabharwal | Ashish Verma | Saurabh Goyal | Saurabh M. Raje | S. Raje

[1] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[2] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[3] Dmitry P. Vetrov,et al. Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[4] Xuancheng Ren,et al. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[5] André F. T. Martins,et al. Sparse Sequence-to-Sequence Models , 2019, ACL.

[6] Xiangyu Zhang,et al. Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[8] Xiaodong Liu,et al. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.

[9] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[11] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[12] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[13] J. Scott McCarley,et al. Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[14] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[15] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[16] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[17] Ziheng Wang,et al. Structured Pruning of Large Language Models , 2020, EMNLP.

[18] André F. T. Martins,et al. Adaptively Sparse Transformers , 2019, EMNLP.

[19] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[20] Jaime G. Carbonell,et al. Mining Insights from Large-Scale Corpora Using Fine-Tuned Language Models , 2020, ECAI.

[21] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[22] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[23] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[24] Misha Denil,et al. Predicting Parameters in Deep Learning , 2014 .

[25] Eunhyeok Park,et al. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[26] Timo Aila,et al. Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[27] Vineeth N. Balasubramanian,et al. Deep Model Compression: Distilling Knowledge from Noisy Teachers , 2016, ArXiv.

[28] Ramón Fernández Astudillo,et al. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[29] Mirella Lapata,et al. Text Summarization with Pretrained Encoders , 2019, EMNLP.