SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers

We introduce SELFEXPLAIN, a novel selfexplaining model that explains a text classifier’s predictions using phrase-based concepts. SELFEXPLAIN augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SELFEXPLAIN facilitates interpretability without sacrificing performance. Most importantly, explanations from SELFEXPLAIN show sufficiency for model predictions and are perceived as adequate, trustworthy and understandable by human judges compared to existing widely-used baselines.1

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[3]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Noah A. Smith,et al.  Deep Weighted Averaging Classifiers , 2018, FAT.

[6]  Tommi S. Jaakkola,et al.  Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control , 2019, EMNLP.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Mai ElSherief,et al.  Mitigating Gender Bias in Natural Language Processing: Literature Review , 2019, ACL.

[9]  Tommi S. Jaakkola,et al.  Towards Robust Interpretability with Self-Explaining Neural Networks , 2018, NeurIPS.

[10]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[11]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[12]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[13]  Mark O. Riedl,et al.  Automated rationale generation: a technique for explainable AI and its effects on human perceptions , 2019, IUI.

[14]  Yoav Goldberg,et al.  Understanding Convolutional Neural Networks for Text Classification , 2018, BlackboxNLP@EMNLP.

[15]  Cynthia Rudin,et al.  This Looks Like That: Deep Learning for Interpretable Image Recognition , 2018 .

[16]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[17]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[18]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[19]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[20]  Graham Neubig,et al.  Differentiable Reasoning over a Virtual Knowledge Base , 2020, ICLR.

[21]  Roberto Basili,et al.  Auditing Deep Learning processes through Kernel-based Explanatory Models , 2019, EMNLP.

[22]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[23]  Cynthia Rudin,et al.  The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification , 2014, NIPS.

[24]  Chandan Singh,et al.  Hierarchical interpretations for neural network predictions , 2018, ICLR.

[25]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[26]  Fabio Massimo Zanzotto,et al.  KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations , 2020, EMNLP.

[27]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[28]  Richard Montague,et al.  ENGLISH AS A FORMAL LANGUAGE , 1975 .

[29]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[30]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[31]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[32]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[33]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[34]  Xiaochuang Han,et al.  Influence Tuning: Demoting Spurious Correlations via Instance Attribution and Instance-Driven Updates , 2021, EMNLP.

[35]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[36]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[37]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[38]  Byron C. Wallace,et al.  Learning to Faithfully Rationalize by Construction , 2020, ACL.

[39]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[40]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[41]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[42]  Xiang Ren,et al.  Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2020, ICLR.

[43]  Graham Neubig,et al.  Learning to Deceive with Attention-Based Explanations , 2020, ACL.

[44]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[45]  David Weinberger,et al.  Accountability of AI Under the Law: The Role of Explanation , 2017, ArXiv.

[46]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[47]  Yulia Tsvetkov,et al.  Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions , 2020, ACL.

[48]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[49]  Francesca Toni,et al.  Human-grounded Evaluations of Explanation Methods for Text Classification , 2019, EMNLP.

[50]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[51]  Sercan O. Arik,et al.  ProtoAttend: Attention-Based Prototypical Learning , 2019, J. Mach. Learn. Res..

[52]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[53]  Fred D. Davis Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology , 1989, MIS Q..

[54]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[55]  Ivan Titov,et al.  Interpretable Neural Predictions with Differentiable Binary Variables , 2019, ACL.

[56]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[57]  Mohit Bansal,et al.  Self-Assembling Modular Networks for Interpretable Multi-Hop Reasoning , 2019, EMNLP/IJCNLP.

[58]  Jason Eisner,et al.  Modeling Annotators: A Generative Approach to Learning from Annotator Rationales , 2008, EMNLP.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Noah A. Smith,et al.  Topics to Avoid: Demoting Latent Confounds in Text Classification , 2019, EMNLP.

[61]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[62]  Mohit Bansal,et al.  Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? , 2020, ACL.

[63]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[64]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[65]  Noah A. Smith,et al.  Measuring Association Between Labels and Free-Text Rationales , 2020, EMNLP.

[66]  Chun-Liang Li,et al.  On Completeness-aware Concept-Based Explanations in Deep Neural Networks , 2020, NeurIPS.

[67]  Byron C. Wallace,et al.  Combining Feature and Instance Attribution to Detect Artifacts , 2021, ArXiv.

[68]  Been Kim,et al.  Concept Bottleneck Models , 2020, ICML.

[69]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[70]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.