论文信息 - Not All Attention Is All You Need - 字舞流文

Not All Attention Is All You Need

Self-attention based models have achieved remarkable success in natural language processing. However, the self-attention network design is questioned as suboptimal in recent studies, due to its veiled validity and high redundancy. In this paper, we focus on pre-trained language models with self-pruning training design on taskspecific tuning. We demonstrate that the lighter state-of-the-art models with nearly 80% of self-attention layers pruned, may achieve even better results on multiple tasks, including natural language understanding, document classification, named entity recognition and POS tagging, with nearly twice faster inference.

Hai Zhao | Min Zhang | Hongqiu Wu | Hai Zhao | Min Zhang | Hongqi Wu | Hongqiu Wu

[1] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.

[2] Shahrokh Valaee,et al. EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks , 2021, IEEE transactions on neural networks and learning systems.

[3] Wonyong Sung,et al. Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..

[4] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[5] Hai Zhao,et al. Code Summarization with Structure-induced Transformer , 2020, FINDINGS.

[6] Anamitra R. Choudhury,et al. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.

[7] Timo Aila,et al. Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[8] James T. Kwok,et al. SparseBERT: Rethinking the Importance Analysis in Self-attention , 2021, ICML.

[9] Mohit Iyyer,et al. Hard-Coded Gaussian Attention for Neural Machine Translation , 2020, ACL.

[10] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[11] Furu Wei,et al. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing , 2020, EMNLP.

[12] David Chiang,et al. Auto-Sizing Neural Networks: With Applications to n-gram Language Models , 2015, EMNLP.

[13] Christopher D. Manning,et al. Compression of Neural Machine Translation Models via Pruning , 2016, CoNLL.

[14] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[15] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16] Il-Chul Moon,et al. Adversarial Dropout for Supervised and Semi-supervised Learning , 2017, AAAI.

[17] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[18] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[19] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[20] Ashish Khetan,et al. schuBERT: Optimizing Elements of BERT , 2020, ACL.

[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[22] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[23] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[24] Andreas Loukas,et al. Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth , 2021, ICML.

[25] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[27] Yiming Yang,et al. DARTS: Differentiable Architecture Search , 2018, ICLR.

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[31] Ariel D. Procaccia,et al. Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[32] Omer Levy,et al. Improving Transformer Models by Reordering their Sublayers , 2020, ACL.

[33] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[34] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.