论文信息 - SparseBERT: Rethinking the Importance Analysis in Self-attention - 字舞流文

SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.

James T. Kwok | Zhenguo Li | Xiaodan Liang | Hang Xu | Han Shi | Jiahui Gao | Xiaozhe Ren | J. Kwok | Xiaozhe Ren | Xiaodan Liang | Zhenguo Li | Hang Xu | Han Shi | Jiahui Gao

[1] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[2] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[3] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[4] Ankit Singh Rawat,et al. Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[6] Jaegul Choo,et al. SANVis: Visual Analytics for Understanding Self-Attention Networks , 2019, 2019 IEEE Visualization Conference (VIS).

[7] Frank Hutter,et al. Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[8] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[9] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[10] Hongbo Zhang,et al. Quora Question Pairs , 2017 .

[11] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[12] Omer Levy,et al. Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[13] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[14] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.

[16] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[17] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[19] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[20] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[21] Longhui Wei,et al. GOLD-NAS: Gradual, One-Level, Differentiable , 2020, ArXiv.

[22] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[23] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[24] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[25] Liang Lin,et al. SNAS: Stochastic Neural Architecture Search , 2018, ICLR.

[26] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[28] Luke S. Zettlemoyer,et al. Transformers with convolutional context for ASR , 2019, ArXiv.

[29] Wenhu Chen,et al. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[30] Zheng Zhang,et al. Star-Transformer , 2019, NAACL.

[31] Sashank J. Reddi,et al. $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers , 2020, NeurIPS.

[32] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[33] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[34] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[35] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[36] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[37] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[38] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[39] Yiming Yang,et al. DARTS: Differentiable Architecture Search , 2018, ICLR.

[40] Wei Wang,et al. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.