"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification

Feature attribution a.k.a. input salience methods which assign an importance score to a feature are abundant but may produce surprisingly different results for the same model on the same input. While differences are expected if disparate definitions of importance are assumed, most methods claim to provide faithful attributions and point at the features most relevant for a model’s prediction. Existing work on faithfulness evaluation is not conclusive and does not provide a clear answer as to how different methods are to be compared. Focusing on text classification and the model debugging scenario, our main contribution is a protocol for faithfulness evaluation that makes use of partially synthetic data to obtain ground truth for feature importance ranking. Following the protocol, we do an in-depth analysis of four standard salience method classes on a range of datasets and shortcuts for BERT and LSTM models and demonstrate that some of the most popular method configurations provide poor results even for simplest shortcuts. We recommend following the protocol for each new task and model combination to find the best method for identifying shortcuts.

[1]  Ankur Taly,et al.  Exploring Principled Visualizations for Deep Network Attributions , 2019, IUI Workshops.

[2]  Byron C. Wallace,et al.  Combining Feature and Instance Attribution to Detect Artifacts , 2021, ArXiv.

[3]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[4]  Ujwal Gadiraju,et al.  Towards Benchmarking the Utility of Explanations for Model Debugging , 2021, TRUSTNLP.

[5]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[6]  Jasmijn Bastings,et al.  The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , 2020, BLACKBOXNLP.

[7]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Dumitru Erhan,et al.  Evaluating Feature Importance Estimates , 2018, ArXiv.

[10]  Klaus-Robert Müller,et al.  Evaluating Recurrent Neural Network Explanations , 2019, BlackboxNLP@ACL.

[11]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[12]  Thomas Lukasiewicz,et al.  Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods , 2019, ArXiv.

[13]  Thomas Lukasiewicz,et al.  The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal Sufficient Subsets , 2020, ArXiv.

[14]  Shi Feng,et al.  What can AI do for me?: evaluating machine learning interpretations in cooperative play , 2019, IUI.

[15]  Tolga Bolukbasi,et al.  The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models , 2020, EMNLP.

[16]  Sungroh Yoon,et al.  Interpretation of NLP Models through Input Marginalization , 2020, EMNLP.

[17]  Jakob Grue Simonsen,et al.  A Diagnostic Study of Explainability Techniques for Text Classification , 2020, EMNLP.

[18]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[19]  Vaibhav Adlakha,et al.  Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining , 2021, ArXiv.

[20]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[21]  Yonatan Belinkov,et al.  Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference , 2019, ACL.

[22]  Hinrich Schütze,et al.  Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement , 2018, ACL.

[23]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[24]  Philipp Koehn,et al.  Evaluating Saliency Methods for Neural Language Models , 2021, NAACL.

[25]  Noel C. F. Codella,et al.  Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC) , 2019, ArXiv.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[28]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[29]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[30]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[31]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[32]  John Pavlopoulos,et al.  Deeper Attention to Abusive User Content Moderation , 2017, EMNLP.

[33]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[34]  Chih-Kuan Yeh,et al.  On the (In)fidelity and Sensitivity for Explanations. , 2019, 1901.09392.

[35]  Ye Zhang,et al.  Do Human Rationales Improve Machine Explanations? , 2019, BlackboxNLP@ACL.

[36]  R. Hofmann-Wellenhof,et al.  Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. , 2019, JAMA dermatology.

[37]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[38]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[39]  Franck Dernoncourt,et al.  Towards Interpreting and Mitigating Shortcut Learning Behavior of NLU models , 2021, NAACL.

[40]  Ilaria Liccardi,et al.  Debugging Tests for Model Explanations , 2020, NeurIPS.

[41]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[42]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[43]  Dumitru Erhan,et al.  The (Un)reliability of saliency methods , 2017, Explainable AI.

[44]  Yoav Goldberg,et al.  Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data , 2020, EMNLP.

[45]  Misha Denil,et al.  Extraction of Salient Sentences from Labelled Documents , 2014, ArXiv.

[46]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[47]  Dumitru Erhan,et al.  A Benchmark for Interpretability Methods in Deep Neural Networks , 2018, NeurIPS.

[48]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[49]  Serena Booth,et al.  Do Feature Attribution Methods Correctly Attribute Features? , 2021, AAAI.

[50]  Felix Bießmann,et al.  Quantifying Interpretability and Trust in Machine Learning Systems , 2019, ArXiv.

[51]  Daniel S. Weld,et al.  Data Staining: A Method for Comparing Faithfulness of Explainers , 2020 .

[52]  Volker Tresp,et al.  Explaining Therapy Predictions with Layer-Wise Relevance Propagation in Neural Networks , 2018, 2018 IEEE International Conference on Healthcare Informatics (ICHI).