Improving Adversarial Robustness by Data-Specific Discretization

A recent line of research proposed (either implicitly or explicitly) gradient-masking preprocessing techniques to improve adversarial robustness. However, as shown by Athaley-Carlini-Wagner, essentially all these defenses can be circumvented if an attacker leverages approximate gradient information with respect to the preprocessing. This thus raises a natural question of whether there is a useful preprocessing technique in the context of white-box attacks, even just for only mildly complex datasets such as MNIST. In this paper we provide an affirmative answer to this question. Our key observation is that for several popular datasets, one can approximately encode entire dataset using a small set of separable codewords derived from the training set, while retaining high accuracy on natural images. The separability of the codewords in turn prevents small perturbations as in `∞ attacks from changing feature encoding, leading to adversarial robustness. For example, for MNIST our code consists of only two codewords, 0 and 1, and the encoding of any pixel is simply 1[x > 0.5] (i.e., whether a pixel x is at least 0.5). Applying this code to a naturally trained model already gives high adversarial robustness even under strong white-box attacks based on Backward Pass Differentiable Approximation (BPDA) method of Athaley-Carlini-Wagner that takes the codes into account. We give density-estimation based algorithms to construct such codes, and provide theoretical analysis and certificates of when our method can be effective. Systematic evaluation demonstrates that our method is effective in improving adversarial robustness on MNIST, CIFAR-10, and ImageNet, for either naturally or adversarially trained models.

[1]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[2]  Yanjun Qi,et al.  Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[3]  Aditi Raghunathan,et al.  Certified Defenses against Adversarial Examples , 2018, ICLR.

[4]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[5]  Colin Raffel,et al.  Thermometer Encoding: One Hot Way To Resist Adversarial Examples , 2018, ICLR.

[6]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[7]  Yang Song,et al.  PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples , 2017, ICLR.

[8]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[9]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[10]  Rama Chellappa,et al.  Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models , 2018, ICLR.

[11]  Alan L. Yuille,et al.  Mitigating adversarial effects through randomization , 2017, ICLR.

[12]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[13]  Aleksander Madry,et al.  Adversarially Robust Generalization Requires More Data , 2018, NeurIPS.

[14]  J. Zico Kolter,et al.  Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[15]  Hao Chen,et al.  MagNet: A Two-Pronged Defense against Adversarial Examples , 2017, CCS.

[16]  Zoubin Ghahramani,et al.  A study of the effect of JPG compression on adversarial images , 2016, ArXiv.

[17]  Moustapha Cissé,et al.  Countering Adversarial Images using Input Transformations , 2018, ICLR.