Is Attention Interpretable?

Attention mechanisms have recently boosted performance on a range of NLP tasks. Because attention layers explicitly weight input components’ representations, it is also often assumed that attention can be used to identify information that models found important (e.g., specific contextualized word tokens). We test whether that assumption holds by manipulating attention weights in already-trained text classification models and analyzing the resulting differences in their predictions. While we observe some ways in which higher attention weights correlate with greater impact on model predictions, we also find many ways in which this does not hold, i.e., where gradient-based rankings of attention weights better predict their effects than their magnitudes. We conclude that while attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator.1

[1]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[2]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[3]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[4]  Fan Yang,et al.  Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features , 2017, EMNLP.

[5]  Shi Feng,et al.  Pathologies of Neural Models Make Interpretations Difficult , 2018, EMNLP.

[6]  Tom M. Mitchell,et al.  A Compositional and Interpretable Semantic Space , 2015, NAACL.

[7]  Li Zhao,et al.  Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Xiaoli Z. Fern,et al.  Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference , 2018, EMNLP.

[10]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[11]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[12]  Alexander J. Smola,et al.  Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) , 2014, KDD.

[13]  Yang Liu,et al.  Learning Structured Text Representations , 2017, TACL.

[14]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[15]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[16]  Klaus-Robert Müller,et al.  Explaining Predictions of Non-Linear Classifiers in NLP , 2016, Rep4NLP@ACL.

[17]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[18]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[19]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[20]  Yang Liu,et al.  Visualizing and Understanding Neural Machine Translation , 2017, ACL.

[21]  Tommi S. Jaakkola,et al.  Towards Robust Interpretability with Self-Explaining Neural Networks , 2018, NeurIPS.

[22]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[23]  Dong Nguyen,et al.  Comparing Automatic and Human Evaluation of Local Explanations for Text Classification , 2018, NAACL.

[24]  Noah A. Smith,et al.  Neural Discourse Structure for Text Categorization , 2017, ACL.

[25]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[26]  Andrew Slavin Ross,et al.  Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , 2017, IJCAI.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[29]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[30]  Jun-Seok Kim,et al.  Interactive Visualization and Manipulation of Attention-based Neural Machine Translation , 2017, EMNLP.

[31]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[32]  Benno Stein,et al.  Before Name-Calling: Dynamics and Triggers of Ad Hominem Fallacies in Web Argumentation , 2018, NAACL.

[33]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[36]  Harsh Jhamtani,et al.  SPINE: SParse Interpretable Neural Embeddings , 2017, AAAI.

[37]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[38]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[39]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.