Contrastive Explanations for Model Interpretability

Contrastive explanations clarify why an event occurred in contrast to another. They are more inherently intuitive to humans to both produce and comprehend. We propose a methodology to produce contrastive explanations for classification models by modifying the representation to disregard non-contrastive information, and modifying model behavior to only be based on contrastive reasoning. Our method is based on projecting model representation to a latent space that captures only the features that are useful (to the model) to differentiate two potential decisions. We demonstrate the value of contrastive explanations by analyzing two different scenarios, using both high-level abstract concept attribution and low-level input token/span attribution, on two widely used text classification tasks. Specifically, we produce explanations for answering: for which label, and against which alternative label, is some aspect of the input useful? And which aspects of the input are useful for and against particular decisions? Overall, our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model’s decision.

[1]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[2]  Andrew Slavin Ross,et al.  Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , 2017, IJCAI.

[3]  Timo Freiesleben,et al.  The Intriguing Relation Between Counterfactual Explanations and Adversarial Examples , 2020, Minds and Machines.

[4]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[5]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[6]  Yulia Tsvetkov,et al.  Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions , 2020, ACL.

[7]  Yoav Goldberg,et al.  Aligning Faithful Interpretations with their Social Attribution , 2020, ArXiv.

[8]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[9]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[10]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[11]  Shubham Rathi,et al.  Generating Counterfactual and Contrastive Explanations using SHAP , 2019, ArXiv.

[12]  Ilia Stepin,et al.  A Survey of Contrastive and Counterfactual Explanation Generation Methods for Explainable Artificial Intelligence , 2021, IEEE Access.

[13]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[14]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[15]  Yoav Goldberg,et al.  Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals , 2021, Transactions of the Association for Computational Linguistics.

[16]  Alexandra Chouldechova,et al.  What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes , 2019, NAACL.

[17]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[19]  Jeffrey Heer,et al.  Polyjuice: Automated, General-purpose Counterfactual Generation , 2021, ArXiv.

[20]  Noah Goodman,et al.  Investigating Transferability in Pretrained Language Models , 2020, EMNLP.

[21]  Ana Marasovi'c,et al.  Explaining NLP Models via Minimal Contrastive Editing (MiCE) , 2021, FINDINGS.

[22]  Suresh Venkatasubramanian,et al.  Problems with Shapley-value-based explanations as feature importance measures , 2020, ICML.

[23]  Tim Miller,et al.  Contrastive explanation: a structural-model approach , 2018, The Knowledge Engineering Review.

[24]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[25]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[26]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[27]  Daniel S. Weld,et al.  Data Staining: A Method for Comparing Faithfulness of Explainers , 2020 .

[28]  Max Welling,et al.  Visualizing Deep Neural Network Decisions: Prediction Difference Analysis , 2017, ICLR.

[29]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[30]  Uri Shalit,et al.  CausaLM: Causal Model Explanation Through Counterfactual Language Models , 2020, CL.

[31]  Yoav Goldberg,et al.  Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , 2020, ACL.

[32]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[33]  Trevor Darrell,et al.  Generating Counterfactual Explanations with Natural Language , 2018, ICML 2018.

[34]  Yoav Goldberg,et al.  Where’s My Head? Definition, Data Set, and Models for Numeric Fused-Head Identification and Resolution , 2019, Transactions of the Association for Computational Linguistics.

[35]  Sungroh Yoon,et al.  Interpretation of NLP Models through Input Marginalization , 2020, EMNLP.

[36]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[37]  Jeffrey Heer,et al.  Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models , 2021, ACL.

[38]  D. Hilton Knowledge-Based Causal Attribution : The Abnormal Conditions Focus Model , 2004 .

[39]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[40]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[41]  Yonatan Belinkov,et al.  Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? , 2020, EACL.

[42]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[43]  Florian Mohnert,et al.  Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information , 2018, BlackboxNLP@EMNLP.

[44]  Germund Hesslow,et al.  The problem of causal selection , 1988 .

[45]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[46]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[47]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[48]  Yang Liu,et al.  Actionable Recourse in Linear Classification , 2018, FAT.

[49]  Trevor Darrell,et al.  Contrastive Examples for Addressing the Tyranny of the Majority , 2020, ArXiv.

[50]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[51]  Yonatan Belinkov,et al.  Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , 2020, ArXiv.

[52]  Keith A. Markus,et al.  Making Things Happen: A Theory of Causal Explanation , 2007 .

[53]  Richard Meyes,et al.  Under the Hood of Neural Networks: Characterizing Learned Representations by Functional Neuron Populations and Network Ablations , 2020, ArXiv.

[54]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[55]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.