On the Faithfulness Measurements for Model Interpretations

Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretations, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent they conform to the reasoning process behind the model. To tackle these issues, we start with three criteria: the removal-based criterion, the sensitivity of interpretations, and the stability of interpretations, that quantify different notions of faithfulness, and propose novel paradigms to systematically evaluate interpretations in NLP. Our results show that the performance of interpretations under different criteria of faithfulness could vary substantially. Motivated by the desideratum of these faithfulness notions, we introduce a new class of interpretation methods that adopt techniques from the adversarial robustness domain. Empirical results show that our proposed methods achieve top performance under all three criteria. Along with experiments and analysis on both the text classification and the dependency parsing tasks, we come to a more comprehensive understanding of the diverse set of interpretations.

[1]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[2]  Jakob Grue Simonsen,et al.  A Diagnostic Study of Explainability Techniques for Text Classification , 2020, EMNLP.

[3]  Cho-Jui Hsieh,et al.  Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond , 2020, NeurIPS.

[4]  Mohit Bansal,et al.  Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? , 2020, ACL.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[7]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[8]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[9]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[10]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[11]  Cho-Jui Hsieh,et al.  Robustness Verification for Transformers , 2020, ICLR.

[12]  Shi Feng,et al.  Pathologies of Neural Models Make Interpretations Difficult , 2018, EMNLP.

[13]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[14]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[15]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[16]  Pradeep Ravikumar,et al.  Evaluations and Methods for Explanation through Robustness Analysis , 2019, ICLR.

[17]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[18]  Jason Eisner,et al.  Modeling Annotators: A Generative Approach to Learning from Annotator Rationales , 2008, EMNLP.

[19]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[20]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[21]  Dumitru Erhan,et al.  The (Un)reliability of saliency methods , 2017, Explainable AI.

[22]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[23]  Xia Hu,et al.  Are Interpretations Fairly Evaluated? A Definition Driven Pipeline for Post-Hoc Interpretability , 2020, ArXiv.

[24]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[25]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[26]  Tommi S. Jaakkola,et al.  On the Robustness of Interpretability Methods , 2018, ArXiv.

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[29]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[30]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[31]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32]  Cengiz Öztireli,et al.  Towards better understanding of gradient-based attribution methods for Deep Neural Networks , 2017, ICLR.

[33]  Jasmijn Bastings,et al.  The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , 2020, BLACKBOXNLP.

[34]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[35]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[36]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Timothy A. Mann,et al.  On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , 2018, ArXiv.

[39]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.