论文信息 - Interpreting Adversarial Examples by Activation Promotion and Suppression

Interpreting Adversarial Examples by Activation Promotion and Suppression

It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: images with imperceptible perturbations crafted to fool classifiers. However, interpretability of these perturbations is less explored in the literature. This work aims to better understand the roles of adversarial perturbations and provide visual explanations from pixel, image and network perspectives. We show that adversaries have a promotion-suppression effect (PSE) on neurons' activations and can be primarily categorized into three types: i) suppression-dominated perturbations that mainly reduce the classification score of the true label, ii) promotion-dominated perturbations that focus on boosting the confidence of the target label, and iii) balanced perturbations that play a dual role in suppression and promotion. We also provide image-level interpretability of adversarial examples. This links PSE of pixel-level perturbations to class-specific discriminative image regions localized by class activation mapping (Zhou et al. 2016). Further, we examine the adversarial effect through network dissection (Bau et al. 2017), which offers concept-level interpretability of hidden units. We show that there exists a tight connection between the units' sensitivity to adversarial attacks and their interpretability on semantic concepts. Lastly, we provide some new insights from our interpretation to improve the adversarial robustness of networks.

[1] M. Yuan,et al. Model selection and estimation in regression with grouped variables , 2006 .

[2] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[3] Yoav Goldberg,et al. LaVAN: Localized and Visible Adversarial Noise , 2018, ICML.

[4] Micah Sherr,et al. Hidden Voice Commands , 2016, USENIX Security Symposium.

[5] Hao Cheng,et al. Adversarial Robustness vs. Model Compression, or Both? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Martín Abadi,et al. Adversarial Patch , 2017, ArXiv.

[7] Jinfeng Yi,et al. EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples , 2017, AAAI.

[8] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[9] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[10] Ting Wang,et al. Interpretable Deep Learning under Fire , 2018, USENIX Security Symposium.

[11] Bolei Zhou,et al. Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Ananthram Swami,et al. The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[13] David A. Wagner,et al. Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[14] Vineeth N. Balasubramanian,et al. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15] John C. Duchi,et al. Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[16] Bolei Zhou,et al. Visualizing and Understanding Generative Adversarial Networks (Extended Abstract) , 2019, ArXiv.

[17] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[18] Samy Bengio,et al. Adversarial examples in the physical world , 2016, ICLR.

[19] Xiang Chen,et al. ASP: A Fast Adversarial Attack Example Generation Framework based on Adversarial Saliency Prediction , 2018, ArXiv.

[20] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Samy Bengio,et al. Adversarial Machine Learning at Scale , 2016, ICLR.

[22] David Wagner,et al. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.

[23] Hang Su,et al. Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples , 2017, ArXiv.

[24] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] David A. Wagner,et al. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[26] Kate Saenko,et al. RISE: Randomized Input Sampling for Explanation of Black-box Models , 2018, BMVC.

[27] Deniz Erdogmus,et al. Structured Adversarial Attack: Towards General Implementation and Better Interpretability , 2018, ICLR.

[28] Michael I. Jordan,et al. ML-LOO: Detecting Adversarial Examples with Feature Attribution , 2019, AAAI.

[29] Mingyan Liu,et al. Spatially Transformed Adversarial Examples , 2018, ICLR.

[30] Liwei Wang,et al. RANDOM MASK: Towards Robust Convolutional Neural Networks , 2020, ArXiv.

[31] Aleksander Madry,et al. Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[32] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[33] Adam M. Oberman,et al. Improved robustness to adversarial examples using Lipschitz regularization of the loss , 2018, ArXiv.

[34] Ananthram Swami,et al. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[35] Sijia Liu,et al. Topology Attack and Defense for Graph Neural Networks: An Optimization Perspective , 2019, IJCAI.

[36] Jason Yosinski,et al. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).