论文信息 - Towards the Unification and Robustness of Perturbation and Gradient Based Explanations - 字舞流文

Towards the Unification and Robustness of Perturbation and Gradient Based Explanations

As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a post hoc manner. In this work, we analyze two popular post hoc interpretation techniques: SmoothGrad which is a gradient based method, and a variant of LIME which is a perturbation based method. More specifically, we derive explicit closed form expressions for the explanations output by these two methods and show that they both converge to the same explanation in expectation, i.e., when the number of perturbed samples used by these methods is large. We then leverage this connection to establish other desirable properties, such as robustness, for these techniques. We also derive finite sample complexity bounds for the number of perturbations required for these methods to converge to their expected explanation. Finally, we empirically validate our theory using extensive experimentation on both synthetic and real world datasets.1

Chirag Agarwal | Zhiwei Steven Wu | Himabindu Lakkaraju | Shahin Jabbari | Sohini Upadhyay | Sushant Agarwal

[1] Carlos Guestrin,et al. Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[2] R. Tibshirani,et al. Classification by Set Cover: The Prototype Vector Machine , 2009, 0908.2284.

[3] Yomi Kastro,et al. Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks , 2019, Neural Computing and Applications.

[4] Alexander Levine,et al. Certifiably Robust Interpretation in Deep Learning , 2019, ArXiv.

[5] Been Kim,et al. Sanity Checks for Saliency Maps , 2018, NeurIPS.

[6] Sameer Singh,et al. How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods , 2019, ArXiv.

[7] Janis Klaise,et al. Interpretable Counterfactual Explanations Guided by Prototypes , 2019, ECML/PKDD.

[8] Tommi S. Jaakkola,et al. On the Robustness of Interpretability Methods , 2018, ArXiv.

[9] Martin Wattenberg,et al. SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[10] Been Kim,et al. Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[11] Chris Russell,et al. Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[12] Cynthia Rudin,et al. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[13] Somesh Jha,et al. Concise Explanations of Neural Networks using Adversarial Training , 2018, ICML.

[14] Ulrike von Luxburg,et al. Looking deeper into LIME , 2020, ArXiv.

[15] Scott Lundberg,et al. A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[16] Cynthia Rudin,et al. The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification , 2014, NIPS.

[17] Peter Harremoës,et al. Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[18] Peter A. Flach,et al. FACE: Feasible and Actionable Counterfactual Explanations , 2020, AIES.

[19] Johannes Gehrke,et al. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[20] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[21] Jure Leskovec,et al. Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[22] Jun S. Liu,et al. Siegel ’ s formula via Stein ’ s identities , 2003 .

[23] Johannes Gehrke,et al. Intelligible models for classification and regression , 2012, KDD.

[24] Cynthia Rudin,et al. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[25] Osbert Bastani,et al. Interpretability via Model Extraction , 2017, ArXiv.

[26] Himabindu Lakkaraju,et al. "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations , 2019, AIES.

[27] Himabindu Lakkaraju,et al. How Much Should I Trust You? Modeling Uncertainty of Black Box Explanations , 2020, ArXiv.

[28] Abubakar Abid,et al. Interpretation of Neural Networks is Fragile , 2017, AAAI.

[29] Martin Wattenberg,et al. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[30] V. Koltchinskii,et al. Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[31] Amir-Hossein Karimi,et al. Model-Agnostic Counterfactual Explanations for Consequential Decisions , 2019, AISTATS.

[32] Solon Barocas,et al. The hidden assumptions behind counterfactual explanations and principal reasons , 2019, FAT*.

[33] Bernhard Schölkopf,et al. Algorithmic Recourse: from Counterfactual Explanations to Interventions , 2020, FAccT.

[34] Joel A. Tropp,et al. User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[35] Percy Liang,et al. Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[36] Z. Landsman,et al. Stein's Lemma for elliptical random vectors , 2008 .

[37] Ankur Taly,et al. Axiomatic Attribution for Deep Networks , 2017, ICML.

[38] Klaus-Robert Müller,et al. Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[39] Jure Leskovec,et al. Faithful and Customizable Explanations of Black Box Models , 2019, AIES.

[40] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[41] Bernhard Schölkopf,et al. Algorithmic recourse under imperfect causal knowledge: a probabilistic approach , 2020, NeurIPS.

[42] Jakub M. Tomczak,et al. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction , 2016, Expert Syst. Appl..

[43] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Cengiz Öztireli,et al. Towards better understanding of gradient-based attribution methods for Deep Neural Networks , 2017, ICLR.

[45] J. Zico Kolter,et al. Certified Adversarial Robustness via Randomized Smoothing , 2019, ICML.

[46] Yang Liu,et al. Actionable Recourse in Linear Classification , 2018, FAT.