Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution algorithm to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we extract the most salient dependencies in each layer to construct an attribution graph, which reveals the hierarchical interactions inside Transformer. Furthermore, we apply self-attention attribution to identify the important attention heads, while others can be pruned with only marginal performance degradation. Finally, we show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[3]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2020, AAAI.

[4]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[5]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[6]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[7]  Arthur Szlam,et al.  Automatic Rule Extraction from Long Short Term Memory Networks , 2016, ICLR.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[10]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[11]  Alexander Binder,et al.  Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers , 2016, ICANN.

[12]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[13]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[14]  Yang Liu,et al.  On Identifiability in Transformers , 2020, ICLR.

[15]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[16]  Ming Zhou,et al.  InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training , 2021, NAACL.

[17]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[18]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[19]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[20]  Xiaoli Z. Fern,et al.  Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference , 2018, EMNLP.

[21]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[22]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[23]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[24]  Rudolf Rosa,et al.  Inducing Syntactic Trees from BERT Representations , 2019, ArXiv.

[25]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[26]  Xiang Ren,et al.  Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2020, ICLR.

[27]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[28]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[29]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[30]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[31]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[32]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[33]  Yangfeng Ji,et al.  Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection , 2020, ACL.

[34]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[37]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[38]  Rudolf Rosa,et al.  From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions , 2019, BlackboxNLP@ACL.

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[41]  Bin Yu,et al.  Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs , 2018, ICLR.

[42]  Li Dong,et al.  Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension , 2019, EMNLP.

[43]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[44]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[45]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[46]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.