Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

[1]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[2]  J. Reidenberg,et al.  Accountable Algorithms , 2016 .

[3]  Chris Russell,et al.  Explaining Explanations in AI , 2018, FAT.

[4]  Taesup Moon,et al.  Fooling Neural Network Interpretations via Adversarial Model Manipulation , 2019, NeurIPS.

[5]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[6]  Alok Baveja,et al.  Computing , Artificial Intelligence and Information Technology A data-driven software tool for enabling cooperative information sharing among police departments , 2002 .

[7]  Rich Caruana,et al.  Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation , 2017, AIES.

[8]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[9]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[10]  Klaus-Robert Müller,et al.  Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[11]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[12]  Sébastien Gambs,et al.  Fairwashing: the risk of rationalization , 2019, ICML.

[13]  Solon Barocas,et al.  The Intuitive Appeal of Explainable Machines , 2018 .

[14]  Sherif Sakr,et al.  On the interpretability of machine learning-based model for predicting hypertension , 2019, BMC Medical Informatics and Decision Making.

[15]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[16]  John W. Paisley,et al.  Global Explanations of Neural Networks: Mapping the Landscape of Predictions , 2019, AIES.

[17]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[18]  Martin Wattenberg,et al.  TCAV: Relative concept importance testing with Linear Concept Activation Vectors , 2018 .

[19]  Corey M. Hudson,et al.  Mapping chemical performance on molecular structures using locally interpretable explanations , 2016, 1611.07443.

[20]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[21]  Sameer Singh,et al.  “Why Should I Trust You?”: Explaining the Predictions of Any Classifier , 2016, NAACL.