Learning to Disentangle Robust and Vulnerable Features for Adversarial Detection

Although deep neural networks have shown promising performances on various tasks, even achieving human-level performance on some, they are shown to be susceptible to incorrect predictions even with imperceptibly small perturbations to an input. There exists a large number of previous works which proposed to defend against such adversarial attacks either by robust inference or detection of adversarial inputs. Yet, most of them cannot effectively defend against whitebox attacks where an adversary has a knowledge of the model and defense. More importantly, they do not provide a convincing reason why the generated adversarial inputs successfully fool the target models. To address these shortcomings of the existing approaches, we hypothesize that the adversarial inputs are tied to latent features that are susceptible to adversarial perturbation, which we call vulnerable features. Then based on this intuition, we propose a minimax game formulation to disentangle the latent features of each instance into robust and vulnerable ones, using variational autoencoders with two latent spaces. We thoroughly validate our model for both blackbox and whitebox attacks on MNIST, Fashion MNIST5, and Cat & Dog datasets, whose results show that the adversarial inputs cannot bypass our detector without changing its semantics, in which case the attack has failed.

[1]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[2]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[4]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[5]  Zhitao Gong,et al.  Adversarial and Clean Data Are Not Twins , 2017, aiDM@SIGMOD.

[6]  Jingyi Wang,et al.  Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[7]  Rama Chellappa,et al.  Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models , 2018, ICLR.

[8]  Patrick D. McDaniel,et al.  On the (Statistical) Detection of Adversarial Examples , 2017, ArXiv.

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Jun Zhu,et al.  Boosting Adversarial Attacks with Momentum , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Ian J. Goodfellow,et al.  Technical Report on the CleverHans v2.1.0 Adversarial Examples Library , 2016 .

[12]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[13]  Jun Zhu,et al.  Towards Robust Detection of Adversarial Examples , 2017, NeurIPS.

[14]  Dawn Xiaodong Song,et al.  Delving into Transferable Adversarial Examples and Black-box Attacks , 2016, ICLR.

[15]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[16]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[17]  Yanjun Qi,et al.  Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[18]  Logan Engstrom,et al.  Black-box Adversarial Attacks with Limited Queries and Information , 2018, ICML.

[19]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[20]  Yang Song,et al.  Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nina Narodytska,et al.  Simple Black-Box Adversarial Perturbations for Deep Networks , 2016, ArXiv.

[22]  David Wagner,et al.  Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.

[23]  Wen-Chuan Lee,et al.  NIC: Detecting Adversarial Samples with Neural Network Invariant Checking , 2019, NDSS.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[26]  Luca Rigazio,et al.  Towards Deep Neural Network Architectures Robust to Adversarial Examples , 2014, ICLR.

[27]  Ryan R. Curtin,et al.  Detecting Adversarial Samples from Artifacts , 2017, ArXiv.

[28]  Matthias Bethge,et al.  Towards the first adversarially robust neural network model on MNIST , 2018, ICLR.

[29]  Jinfeng Yi,et al.  ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models , 2017, AISec@CCS.

[30]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31]  Jinfeng Yi,et al.  EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples , 2017, AAAI.