Robust and Stable Black Box Explanations

As machine learning black boxes are increasingly being deployed in real-world applications, there has been a growing interest in developing post hoc explanations that summarize the behaviors of these black boxes. However, existing algorithms for generating such explanations have been shown to lack stability and robustness to distribution shifts. We propose a novel framework for generating robust and stable explanations of black box models based on adversarial training. Our framework optimizes a minimax objective that aims to construct the highest fidelity explanation with respect to the worst-case over a set of adversarial perturbations. We instantiate this algorithm for explanations in the form of linear models and decision sets by devising the required optimization procedures. To the best of our knowledge, this work makes the first attempt at generating post hoc explanations that are robust to a general class of adversarial perturbations that are of practical interest. Experimental evaluation with real-world and synthetic datasets demonstrates that our approach substantially improves robustness of explanations without sacrificing their fidelity on the original data distribution.

[1]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[2]  Rishabh Singh,et al.  Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections , 2018, NeurIPS.

[3]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[4]  Hamsa Bastani Predicting with Proxies: Transfer Learning in High Dimension , 2018 .

[5]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[6]  Vahab S. Mirrokni,et al.  Non-monotone submodular maximization under matroid and knapsack constraints , 2009, STOC '09.

[7]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[8]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[9]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[10]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[11]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[12]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[13]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[14]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[15]  Alexander J. Smola,et al.  Convex Learning with Invariances , 2007, NIPS.

[16]  Himabindu Lakkaraju,et al.  "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations , 2019, AIES.

[17]  Osbert Bastani,et al.  Interpretability via Model Extraction , 2017, ArXiv.

[18]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[19]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[20]  Antonio Criminisi,et al.  Measuring Neural Net Robustness with Constraints , 2016, NIPS.

[21]  John C. Duchi,et al.  Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences , 2016, NIPS.

[22]  Osbert Bastani,et al.  Learning Interpretable Models with Causal Guarantees , 2019, ArXiv.

[23]  Thore Graepel,et al.  Invariant Pattern Recognition by Semi-Definite Programming Machines , 2003, NIPS.

[24]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[25]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[26]  Uri Shaham,et al.  Understanding adversarial training: Increasing local stability of supervised models through robust optimization , 2015, Neurocomputing.

[27]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[28]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[29]  Jure Leskovec,et al.  Faithful and Customizable Explanations of Black Box Models , 2019, AIES.

[30]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[31]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[33]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[34]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[35]  ChengXiang Zhai,et al.  A two-stage approach to domain adaptation for statistical classifiers , 2007, CIKM '07.

[36]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[37]  Cynthia Rudin,et al.  Learning Cost-Effective and Interpretable Treatment Regimes , 2017, AISTATS.

[38]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[39]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[40]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[42]  Sameer Singh,et al.  How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods , 2019, ArXiv.

[43]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[44]  Klaus-Robert Müller,et al.  Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[45]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.