Enhancing Adversarial Robustness For Image Classification By Regularizing Class Level Feature Distribution

Recent researches have shown that deep neural networks (DNNs) are vulnerable to adversarial examples. Adversarial training is practically the most effective approach to improve the robustness of DNNs against adversarial examples. However, conventional adversarial training methods only focus on the classification results or the instance level relationship on feature representations for adversarial examples. Inspired by the fact that adversarial examples break the distinguishability of the feature representations of DNNs for different classes, we propose Intra and Inter Class Feature Regularization $(\mathrm{I}^{2}$ FR) to make the feature distribution of adversarial examples maintain the same classification property as clean examples. On the one hand, the intra-class regularization restricts the distance of features between adversarial examples and both the corresponding clean data and samples for the same class. On the other hand, the inter-class regularization prevents the feature of adversarial examples from getting close to other classes. By adding $\mathrm{I}^{2}$ FR in both adversarial example generation and model training steps in adversarial training, we can get stronger and more diverse adversarial examples, and the neural network learns a more distinguishable and reasonable feature distribution. Experiments on various adversarial training frameworks demonstrate that $\mathrm{I}^{2}$ FR is adaptive for multiple training frameworks and outperforms the state-of-the-art methods for classification of both clean data and adversarial examples.

[1]  Trung Le,et al.  Improving Adversarial Robustness by Enforcing Local and Global Compactness , 2020, ECCV.

[2]  James Bailey,et al.  Improving Adversarial Robustness Requires Revisiting Misclassified Examples , 2020, ICLR.

[3]  Matthias Hein,et al.  Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , 2020, ICML.

[4]  Carl Vondrick,et al.  Metric Learning for Adversarial Robustness , 2019, NeurIPS.

[5]  Haichao Zhang,et al.  Defense Against Adversarial Attacks Using Feature Scattering-based Adversarial Training , 2019, NeurIPS.

[6]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[7]  Harini Kannan,et al.  Adversarial Logit Pairing , 2018, NIPS 2018.

[8]  Rama Chellappa,et al.  Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models , 2018, ICLR.

[9]  Aditi Raghunathan,et al.  Certified Defenses against Adversarial Examples , 2018, ICLR.

[10]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[12]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[13]  Max Jaderberg,et al.  Spatial Transformer Networks , 2015, NIPS.

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[17]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Moustapha Cissé,et al.  Countering Adversarial Images using Input Transformations , 2018, ICLR.

[19]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .