Understanding and Diagnosing Vulnerability under Adversarial Attacks

Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks. Currently, there is no clear insight into how slight perturbations cause such a large difference in classification results and how we can design a more robust model architecture. In this work, we propose a novel interpretability method, InterpretGAN, to generate explanations for features used for classification in latent variables. Interpreting the classification process of adversarial examples exposes how adversarial perturbations influence features layer by layer as well as which features are modified by perturbations. Moreover, we design the first diagnostic method to quantify the vulnerability contributed by each layer, which can be used to identify vulnerable parts of model architectures. The diagnostic results show that the layers introducing more information loss tend to be more vulnerable than other layers. Based on the findings, our evaluation results on MNIST and CIFAR10 datasets suggest that average pooling layers, with lower information loss, are more robust than max pooling layers for the network architectures studied in this paper.

[1]  Ankur Taly,et al.  Explainable machine learning in deployment , 2019, FAT*.

[2]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[3]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[4]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Klaus-Robert Müller,et al.  Investigating the influence of noise and distractors on the interpretation of neural networks , 2016, ArXiv.

[6]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Atul Prakash,et al.  Efficient Adversarial Training With Transferable Adversarial Examples , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Thomas Brox,et al.  Inverting Visual Representations with Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[13]  Matthias Hein,et al.  Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation , 2017, NIPS.

[14]  Carola-Bibiane Schönlieb,et al.  On the Connection Between Adversarial Robustness and Saliency Map Interpretability , 2019, ICML.

[15]  Aditi Raghunathan,et al.  Certified Defenses against Adversarial Examples , 2018, ICLR.

[16]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[19]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  J. Zico Kolter,et al.  Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[21]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[22]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[23]  Larry S. Davis,et al.  Adversarial Training for Free! , 2019, NeurIPS.

[24]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[25]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[26]  Zhanxing Zhu,et al.  Interpreting Adversarially Trained Convolutional Neural Networks , 2019, ICML.

[27]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[28]  J. Zico Kolter,et al.  Certified Adversarial Robustness via Randomized Smoothing , 2019, ICML.

[29]  Pouya Samangouei,et al.  ExplainGAN: Model Explanation via Decision Boundary Crossing Transformations , 2018, ECCV.

[30]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[31]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[32]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[33]  Bin Dong,et al.  You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle , 2019, NeurIPS.

[34]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Chuang Gan,et al.  Interpreting Adversarial Examples by Activation Promotion and Suppression , 2019, ArXiv.

[36]  Brian Pollack,et al.  Explanation by Progressive Exaggeration , 2020, ICLR.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[39]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[40]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.