More Than Words: Towards Better Quality Interpretations of Text Classifiers

The large size and complex decision mechanisms of state-of-the-art text classifiers make it difficult for humans to understand their predictions, leading to a potential lack of trust by the users. These issues have led to the adoption of methods like SHAP and Integrated Gradients to explain classification decisions by assigning importance scores to input tokens. However, prior work, using different randomization tests, has shown that interpretations generated by these methods may not be robust. For instance, models making the same predictions on the test set may still lead to different feature importance rankings. In order to address the lack of robustness of token-based interpretability, we explore explanations at higher semantic levels like sentences. We use computational metrics and human subject studies to compare the quality of sentence-based interpretations against token-based ones. Our experiments show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher granularity level. Based on these findings, we show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.

[1]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[2]  Jakob Grue Simonsen,et al.  A Diagnostic Study of Explainability Techniques for Text Classification , 2020, EMNLP.

[3]  Lalana Kagal,et al.  Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Hinrich Schütze,et al.  Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement , 2018, ACL.

[6]  Bilal Alsallakh,et al.  Captum: A unified and generic model interpretability library for PyTorch , 2020, ArXiv.

[7]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[8]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[9]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[10]  Kai-Wei Chang,et al.  BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.

[11]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[12]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Somesh Jha,et al.  Robust Attribution Regularization , 2019, NeurIPS.

[14]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[15]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[16]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[17]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[18]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[19]  Pascal Frossard,et al.  Sentence-Based Model Agnostic NLP Interpretability , 2020, ArXiv.

[20]  Ankur Taly,et al.  Explainable machine learning in deployment , 2020, FAT*.

[21]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[22]  Klaus-Robert Müller,et al.  Explaining Predictions of Non-Linear Classifiers in NLP , 2016, Rep4NLP@ACL.

[23]  Klaus-Robert Müller,et al.  Evaluating Recurrent Neural Network Explanations , 2019, BlackboxNLP@ACL.

[24]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[25]  Sameer Singh,et al.  AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models , 2019, EMNLP.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Krishnaram Kenthapadi,et al.  On the Lack of Robust Interpretability of Neural Text Classifiers , 2021, FINDINGS.

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[30]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[31]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[32]  Emily Chen,et al.  How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation , 2018, ArXiv.

[33]  Ilaria Liccardi,et al.  Debugging Tests for Model Explanations , 2020, NeurIPS.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[36]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[37]  Shi Feng,et al.  Pathologies of Neural Models Make Interpretations Difficult , 2018, EMNLP.

[38]  Himabindu Lakkaraju,et al.  Reliable Post hoc Explanations: Modeling Uncertainty in Explainability , 2020, NeurIPS.

[39]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[40]  Cuntai Guan,et al.  A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Jure Leskovec,et al.  Faithful and Customizable Explanations of Black Box Models , 2019, AIES.

[42]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[43]  Tolga Bolukbasi,et al.  The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models , 2020, EMNLP.

[44]  Yi Yang,et al.  FinBERT: A Pretrained Language Model for Financial Communications , 2020, ArXiv.

[45]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[46]  Tolga Bolukbasi,et al.  XRAI: Better Attributions Through Regions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[48]  Felix Bießmann,et al.  Quantifying Interpretability and Trust in Machine Learning Systems , 2019, ArXiv.

[49]  Kailash Budhathoki,et al.  Causal structure based root cause analysis of outliers , 2019, ArXiv.

[50]  Andrea Vedaldi,et al.  Understanding Deep Networks via Extremal Perturbations and Smooth Masks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Alexander Binder,et al.  Evaluating the Visualization of What a Deep Neural Network Has Learned , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[52]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[53]  Scott M. Lundberg,et al.  Consistent Individualized Feature Attribution for Tree Ensembles , 2018, ArXiv.

[54]  Su-In Lee,et al.  Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression , 2020, AISTATS.

[55]  Mohit Bansal,et al.  Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? , 2020, ACL.

[56]  Mukund Sundararajan,et al.  The many Shapley values for model explanation , 2019, ICML.

[57]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[58]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[59]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[60]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[61]  Klaus-Robert Müller,et al.  Layer-Wise Relevance Propagation: An Overview , 2019, Explainable AI.

[62]  Tommi S. Jaakkola,et al.  Towards Robust Interpretability with Self-Explaining Neural Networks , 2018, NeurIPS.