ferret: a Framework for Benchmarking Explainers on Transformers

As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them.We introduce ferret, a Python library to simplify the use and comparisons of XAI methods on transformer-based classifiers.With ferret, users can visualize and compare transformers-based models output explanations using state-of-the-art XAI methods on any free-text or existing XAI corpora. Moreover, users can also evaluate ad-hoc XAI metrics to select the most faithful and plausible explanations. To align with the recently consolidated process of sharing and using transformers-based models from Hugging Face, ferret interfaces directly with its Python library.In this paper, we showcase ferret to benchmark XAI methods used on transformers for sentiment analysis and hate speech detection. We show how specific methods provide consistently better explanations and are preferable in the context of transformer models.

[1]  Oskar van der Wal,et al.  Inseq: An Interpretability Toolkit for Sequence Generation Models , 2023, ACL.

[2]  Himabindu Lakkaraju,et al.  OpenXAI: Towards a Transparent Evaluation of Model Explanations , 2022, 2206.11104.

[3]  Robert Schwarzenberg,et al.  Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools , 2021, EMNLP.

[4]  Soumya Sanyal,et al.  Discretized Integrated Gradients for Explaining Language Models , 2021, EMNLP.

[5]  A. Chandar,et al.  Post-hoc Interpretability for Neural NLP: A Survey , 2021, ACM Comput. Surv..

[6]  Elena Baralis,et al.  How Divergent Is Your Data? , 2021, Proc. VLDB Endow..

[7]  Elena Baralis,et al.  Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence , 2021, SIGMOD Conference.

[8]  Cho-Jui Hsieh,et al.  On the Sensitivity and Stability of Model Interpretations in NLP , 2021, ACL.

[9]  Ana Marasović,et al.  Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing , 2021, NeurIPS Datasets and Benchmarks.

[10]  Mohit Bansal,et al.  Robustness Gym: Unifying the NLP Evaluation Landscape , 2021, NAACL.

[11]  Matthew E. Peters,et al.  Explaining NLP Models via Minimal Contrastive Editing (MiCE) , 2020, FINDINGS.

[12]  Seid Muhie Yimam,et al.  HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection , 2020, AAAI.

[13]  Chenhao Tan,et al.  Evaluating and Characterizing Human Rationales , 2020, EMNLP.

[14]  R. Aharonov,et al.  A Survey of the State of Explainable AI for Natural Language Processing , 2020, AACL.

[15]  Jakob Grue Simonsen,et al.  A Diagnostic Study of Explainability Techniques for Text Classification , 2020, EMNLP.

[16]  Bilal Alsallakh,et al.  Captum: A unified and generic model interpretability library for PyTorch , 2020, ArXiv.

[17]  Sebastian Gehrmann,et al.  The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models , 2020, Conference on Empirical Methods in Natural Language Processing.

[18]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[19]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2019, ACL.

[20]  X. Xue,et al.  Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2019, ICLR.

[21]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[22]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Sameer Singh,et al.  AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models , 2019, EMNLP.

[24]  Francesca Toni,et al.  Human-grounded Evaluations of Explanation Methods for Text Classification , 2019, EMNLP.

[25]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[26]  Quanshi Zhang,et al.  Towards a Deep and Unified Understanding of Deep Neural Models in NLP , 2019, ICML.

[27]  Elena Baralis,et al.  Explaining black box models by means of local rules , 2019, SAC.

[28]  Jesse Vig,et al.  Visualizing Attention in Transformer-Based Language Representation Models , 2019, ArXiv.

[29]  Klaus-Robert Müller,et al.  Evaluating Recurrent Neural Network Explanations , 2019, BlackboxNLP@ACL.

[30]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[31]  Tao Li,et al.  Visual Interrogation of Attention-Based Models for Natural Language Inference and Machine Comprehension , 2018, EMNLP.

[32]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[33]  D. Erhan,et al.  A Benchmark for Interpretability Methods in Deep Neural Networks , 2018, NeurIPS.

[34]  Alexander M. Rush,et al.  Seq2seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models , 2018, IEEE Transactions on Visualization and Computer Graphics.

[35]  Shi Feng,et al.  Pathologies of Neural Models Make Interpretations Difficult , 2018, EMNLP.

[36]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[37]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[38]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[39]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[40]  Seth Flaxman,et al.  European Union Regulations on Algorithmic Decision-Making and a "Right to Explanation" , 2016, AI Mag..

[41]  Marco Tulio Ribeiro,et al.  “Why Should I Trust You?”: Explaining the Predictions of Any Classifier , 2016, NAACL.

[42]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[43]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[44]  Jason Eisner,et al.  Modeling Annotators: A Generative Approach to Learning from Annotator Rationales , 2008, EMNLP.

[45]  F. Mercorio,et al.  Contrastive Explanations of Text Classifiers as a Service , 2022, NAACL.

[46]  Dirk Hovy,et al.  Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection , 2022, NLPPOWER.

[47]  Luis Espinosa Anke,et al.  XLM-T: A Multilingual Language Model Toolkit for Twitter , 2021, ArXiv.

[48]  Marko Bohanec,et al.  Perturbation-Based Explanations of Prediction Models , 2018, Human and Machine Learning.