Post-hoc Interpretability for Neural NLP: A Survey

Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.

[1]  Aoshuang Ye,et al.  Towards a Robust Deep Neural Network Against Adversarial Texts: A Survey , 2023, IEEE Transactions on Knowledge and Data Engineering.

[2]  Katja Filippova,et al.  “Will You Find These Shortcuts?” A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification , 2021, EMNLP.

[3]  Eduard Hovy,et al.  Interpreting Deep Learning Models in Natural Language Processing: A Review , 2021, ArXiv.

[4]  Siva Reddy,et al.  Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining , 2021, EMNLP.

[5]  Thomas Lukasiewicz,et al.  Are Training Resources Insufficient? Predict First Then Explain! , 2021, ArXiv.

[6]  Sarath Chandar,et al.  Local Structure Matters Most: Perturbation Study in NLU , 2021, FINDINGS.

[7]  Isabelle Augenstein,et al.  Is Sparse Attention more Interpretable? , 2021, ACL.

[8]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[9]  L. A. Ureña-López,et al.  A Survey on Bias in Deep NLP , 2021, Applied Sciences.

[10]  Yonatan Belinkov,et al.  Probing Classifiers: Promises, Shortcomings, and Advances , 2021, CL.

[11]  Ana Marasović,et al.  Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing , 2021, NeurIPS Datasets and Benchmarks.

[12]  Neil Safier,et al.  Translating , 2021, Information.

[13]  Jeffrey Heer,et al.  Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models , 2021, ACL.

[14]  Caiming Xiong,et al.  FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging , 2020, EMNLP.

[15]  Joelle Pineau,et al.  UnNatural Language Inference , 2020, ACL.

[16]  Matthew E. Peters,et al.  Explaining NLP Models via Minimal Contrastive Editing (MiCE) , 2020, FINDINGS.

[17]  Daniel E. Ho,et al.  Affirmative Algorithms: The Legal Grounds for Fairness as Awareness , 2020, ArXiv.

[18]  Sameer Singh,et al.  Interpreting Predictions of NLP Models , 2020, EMNLP.

[19]  Jasmijn Bastings,et al.  The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , 2020, BLACKBOXNLP.

[20]  Shiyue Zhang,et al.  Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? , 2020, FINDINGS.

[21]  R. Aharonov,et al.  A Survey of the State of Explainable AI for Natural Language Processing , 2020, AACL.

[22]  Sebastian Gehrmann,et al.  The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models , 2020, Conference on Empirical Methods in Natural Language Processing.

[23]  Yonatan Belinkov,et al.  Interpretability and Analysis in Neural NLP , 2020, ACL.

[24]  Jacob Andreas,et al.  Compositional Explanations of Neurons , 2020, NeurIPS.

[25]  F. Rossi,et al.  The State of the Art in Enhancing Trust in Machine Learning Models with the Use of Visualizations , 2020, Comput. Graph. Forum.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Sawan Kumar,et al.  NILE : Natural Language Inference with Faithful Natural Language Explanations , 2020, ACL.

[28]  Yulia Tsvetkov,et al.  Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions , 2020, ACL.

[29]  Shafiq R. Joty,et al.  It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations , 2020, ACL.

[30]  Sameer Singh,et al.  Obtaining Faithful Interpretations from Compositional Neural Networks , 2020, ACL.

[31]  Willem Zuidema,et al.  Quantifying Attention Flow in Transformers , 2020, ACL.

[32]  Jonathan Berant,et al.  Explaining Question Answering Models through Text Generation , 2020, ArXiv.

[33]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[34]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[35]  Ivan Titov,et al.  Information-Theoretic Probing with Minimum Description Length , 2020, EMNLP.

[36]  Amit Dhurandhar,et al.  Model Agnostic Multilevel Explanations , 2020, NeurIPS.

[37]  Anh Nguyen,et al.  SAM: The Sensitivity of Attribution Methods to Hyperparameters , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[39]  Frederick Liu,et al.  Estimating Training Data Influence by Tracking Gradient Descent , 2020, NeurIPS.

[40]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[41]  D. Roth,et al.  Neural Module Networks for Reasoning over Text , 2019, ICLR.

[42]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[43]  Himabindu Lakkaraju,et al.  Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods , 2019, AIES.

[44]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[45]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[46]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[47]  Zachary Chase Lipton,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[48]  Manaal Faruqui,et al.  Attention Interpretability Across NLP Tasks , 2019, ArXiv.

[49]  Zachary Chase Lipton,et al.  Learning to Deceive with Attention-Based Explanations , 2019, ACL.

[50]  Ankur Taly,et al.  Explainable machine learning in deployment , 2019, FAT*.

[51]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[52]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[53]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[54]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[55]  Roger Wattenhofer,et al.  On Identifiability in Transformers , 2019, ICLR.

[56]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[57]  Jaime S. Cardoso,et al.  Machine Learning Interpretability: A Survey on Methods and Metrics , 2019, Electronics.

[58]  Cuntai Guan,et al.  A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[59]  Yash Goyal,et al.  Explaining Classifiers with Causal Concept Effect (CaCE) , 2019, ArXiv.

[60]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[61]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[62]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[63]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[64]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[65]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[66]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[67]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[68]  Andreas Madsen,et al.  Visualizing memorization in RNNs , 2019, Distill.

[69]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[70]  Run Wang,et al.  Towards a Robust Deep Neural Network in Texts: A Survey , 2019 .

[71]  James Zou,et al.  Towards Automatic Concept-based Explanations , 2019, NeurIPS.

[72]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[73]  Chih-Kuan Yeh,et al.  On the (In)fidelity and Sensitivity for Explanations. , 2019, 1901.09392.

[74]  J. Paisley,et al.  Global Explanations of Neural Networks: Mapping the Landscape of Predictions , 2019, AIES.

[75]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[76]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[77]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[78]  Pradeep Ravikumar,et al.  Representer Point Selection for Explaining Deep Neural Networks , 2018, NeurIPS.

[79]  William Yang Wang,et al.  Towards Explainable NLP: A Generative Explanation Framework for Text Classification , 2018, ACL.

[80]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[81]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[82]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[83]  Alex Jones,et al.  Pre-print , 2018 .

[84]  Xia Hu,et al.  Techniques for interpretable machine learning , 2018, Commun. ACM.

[85]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[86]  D. Erhan,et al.  A Benchmark for Interpretability Methods in Deep Neural Networks , 2018, NeurIPS.

[87]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[88]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[89]  Alexander M. Rush,et al.  Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models , 2018, IEEE Transactions on Visualization and Computer Graphics.

[90]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[91]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[92]  Roger Wattenhofer,et al.  Natural Language Multitasking: Analyzing and Improving Syntactic Saliency of Hidden Representations , 2018, NIPS 2018.

[93]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[94]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[95]  J. Wieting,et al.  ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ACL.

[96]  David Weinberger,et al.  Accountability of AI Under the Law: The Role of Explanation , 2017, ArXiv.

[97]  Dumitru Erhan,et al.  The (Un)reliability of saliency methods , 2017, Explainable AI.

[98]  Alice H. Oh,et al.  Rotated Word Vector Representations and their Interpretability , 2017, EMNLP.

[99]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[100]  Tommi S. Jaakkola,et al.  A causal framework for explaining the predictions of black-box sequence-to-sequence models , 2017, EMNLP.

[101]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[102]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[103]  Anca D. Dragan,et al.  Translating Neuralese , 2017, ACL.

[104]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[105]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[106]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[107]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[108]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[109]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[110]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[111]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[112]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[113]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[114]  Neil T. Heffernan,et al.  AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning , 2016, L@S.

[115]  Marco Tulio Ribeiro,et al.  “Why Should I Trust You?”: Explaining the Predictions of Any Classifier , 2016, NAACL.

[116]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[117]  Arne Köhn,et al.  What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation , 2015, EMNLP.

[118]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[119]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[120]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[121]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[122]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[123]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[124]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[125]  Jordan L. Boyd-Graber,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[126]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[127]  Jason W. Osborne,et al.  Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. , 2005 .

[128]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[129]  Judea Pearl,et al.  Direct and Indirect Effects , 2001, UAI.

[130]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[131]  L. Shapley A Value for n-person Games , 1988 .

[132]  S. Weisberg,et al.  Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression , 1980 .

[133]  G. A. Ferguson,et al.  A general rotation criterion and its use in orthogonal rotation , 1970 .

[134]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[135]  Yonatan Belinkov,et al.  Probing Classifiers: Promises, Shortcomings, and Alternatives , 2021, ArXiv.

[136]  Jeffrey Heer,et al.  Polyjuice: Automated, General-purpose Counterfactual Generation , 2021, ArXiv.

[137]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[138]  Lovekesh Vig,et al.  Guided-LIME: Structured Sampling based Hybrid Approach towards Explaining Blackbox Machine Learning Models , 2020, CIKM.

[139]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[140]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[141]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[142]  Yejin Choi,et al.  An Adversarial Winograd Schema Challenge at Scale , 2019 .

[143]  Marko Bohanec,et al.  Perturbation-Based Explanations of Prediction Models , 2018, Human and Machine Learning.

[144]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[145]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2009 .

[146]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[147]  Miguel Ángel García Cumbreras,et al.  Association for Computational Linguistics , 2001 .

[148]  Dragomir R. Radev,et al.  of the Association for Computational Linguistics , 2022 .