论文信息 - Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples?

Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples?

No. I. ATTACKING “ATTACKS MEET INTERPRETABILITY” AmI (Attacks meet Interpretability) is an “attribute-steered” defense [3] to detect [1] adversarial examples [2] on facerecognition models. By applying interpretability techniques to a pre-trained neural network, AmI identifies “important” neurons. It then creates a second augmented neural network with the same parameters but increases the weight activations of important neurons. AmI rejects inputs where the original and augmented neural network disagree. We find that this defense (presented at at NeurIPS 2018 as a spotlight paper—the top 3% of submissions) is completely ineffective, and even defense-oblivious1 attacks reduce the detection rate to 0% on untargeted attacks. That is, AmI is no more robust to untargeted attacks than the undefended original network. Figure 1 shows selected adversarial examples that fool the AmI defense. We are incredibly grateful to the authors for releasing their source code2 which we build on3. We hope that future work will continue to release source code by publication time to accelerate progress in this field.

Nicholas Carlini | Nicholas Carlini

[1] Xiangyu Zhang,et al. Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples , 2018, NeurIPS.

[2] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[3] David Wagner,et al. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.