Robust Models Are More Interpretable Because Attributions Look Normal

Recent work has found that adversarially-robust deep networks used for image classification are more interpretable: their feature attributions tend to be sharper, and are more concentrated on the objects associated with the image’s ground-truth class. We show that smooth decision boundaries play an important role in this enhanced interpretability, as the model’s input gradients around data points will more closely align with boundaries’ normal vectors when they are smooth. Thus, because robust models have smoother boundaries, the results of gradient-based attribution methods, like Integrated Gradients and DeepLift, will capture more accurate information about nearby decision boundaries. This understanding of robust interpretability leads to our second contribution: boundary attributions, which aggregate information about the normal vectors of local decision boundaries to explain a classification outcome. We show that by leveraging the key factors underpinning robust interpretability, boundary attributions produce sharper, more concentrated visual explanations—even on non-robust models. Any example implementation can be found at https://github.com/zifanw/boundary.

[1]  Jaewook Lee,et al.  Lipschitz-Certifiable Training with a Tight Outer Bound , 2020, NeurIPS.

[2]  Dongxiao Zhu,et al.  Explaining Deep Neural Network Models with Adversarial Gradient Integration , 2021, IJCAI.

[3]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[4]  Alexander Binder,et al.  Evaluating the Visualization of What a Deep Neural Network Has Learned , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Corina S. Pasareanu,et al.  Fast Geometric Projections for Local Robustness Certification , 2020, ICLR.

[6]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7]  Carola-Bibiane Schönlieb,et al.  On the Connection Between Adversarial Robustness and Saliency Map Interpretability , 2019, ICML.

[8]  W. Brendel,et al.  Foolbox: A Python toolbox to benchmark the robustness of machine learning models , 2017 .

[9]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[10]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[11]  Alexandros G. Dimakis,et al.  Provable Certificates for Adversarial Examples: Fitting a Ball in the Union of Polytopes , 2019, NeurIPS.

[12]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[13]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Ziyan Wu,et al.  Counterfactual Visual Explanations , 2019, ICML.

[16]  Klaus-Robert Müller,et al.  Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[17]  Matt Fredrikson,et al.  Globally-Robust Neural Networks , 2021, ICML.

[18]  Matthias Bethge,et al.  Foolbox Native: Fast adversarial attacks to benchmark the robustness of machine learning models in PyTorch, TensorFlow, and JAX , 2020, J. Open Source Softw..

[19]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[20]  J. Zico Kolter,et al.  Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[21]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[22]  Karla Diaz-Ordaz,et al.  On Locality of Local Explanation Models , 2021, NeurIPS.

[23]  Matt Fredrikson,et al.  Influence-Directed Explanations for Deep Convolutional Networks , 2018, 2018 IEEE International Test Conference (ITC).