SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.

[1]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[2]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[3]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[4]  Ankit Singh Rawat,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Jaegul Choo,et al.  SANVis: Visual Analytics for Understanding Self-Attention Networks , 2019, 2019 IEEE Visualization Conference (VIS).

[7]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[8]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[9]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[10]  Hongbo Zhang,et al.  Quora Question Pairs , 2017 .

[11]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[12]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[13]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[16]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[17]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[19]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[20]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[21]  Longhui Wei,et al.  GOLD-NAS: Gradual, One-Level, Differentiable , 2020, ArXiv.

[22]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[23]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[24]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[25]  Liang Lin,et al.  SNAS: Stochastic Neural Architecture Search , 2018, ICLR.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[28]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[29]  Wenhu Chen,et al.  Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[30]  Zheng Zhang,et al.  Star-Transformer , 2019, NAACL.

[31]  Sashank J. Reddi,et al.  $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers , 2020, NeurIPS.

[32]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[33]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[34]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[35]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[36]  Di He,et al.  Efficient Training of BERT by Progressively Stacking , 2019, ICML.

[37]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[38]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[39]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[40]  Wei Wang,et al.  Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.