Adversarial Attacks and Defenses

Despite the recent advances in a wide spectrum of applications, machine learning models, especially deep neural networks, have been shown to be vulnerable to adversarial attacks. Attackers add carefully-crafted perturbations to input, where the perturbations are almost imperceptible to humans, but can cause models to make wrong predictions. Techniques to protect models against adversarial input are called adversarial defense methods. Although many approaches have been proposed to study adversarial attacks and defenses in different scenarios, an intriguing and crucial challenge remains that how to really understand model vulnerability? Inspired by the saying that "if you know yourself and your enemy, you need not fear the battles", we may tackle the challenge above after interpreting machine learning models to open the black-boxes. The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models. Recently, some approaches start incorporating interpretation into the exploration of adversarial attacks and defenses. Meanwhile, we also observe that many existing methods of adversarial attacks and defenses, although not explicitly claimed, can be understood from the perspective of interpretation. In this paper, we review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation. We categorize interpretation into two types, feature-level interpretation, and model-level interpretation. For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses. We then briefly illustrate additional correlations between interpretation and adversaries. Finally, we discuss the challenges and future directions for tackling adversary issues with interpretation.

[1]  Hongxia Yang,et al.  Learning Disentangled Representations for Recommendation , 2019, NeurIPS.

[2]  Yarin Gal,et al.  Real Time Image Saliency for Black Box Classifiers , 2017, NIPS.

[3]  Xiangyu Zhang,et al.  Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples , 2018, NeurIPS.

[4]  Zhitao Gong,et al.  Adversarial and Clean Data Are Not Twins , 2017, aiDM@SIGMOD.

[5]  Anind K. Dey,et al.  Why and why not explanations improve the intelligibility of context-aware intelligent systems , 2009, CHI.

[6]  Xia Hu,et al.  Learning Credible Deep Neural Networks with Rationale Regularization , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[7]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[8]  Cynthia Rudin,et al.  This Looks Like That: Deep Learning for Interpretable Image Recognition , 2018 .

[9]  Yan Liu,et al.  Distilling Knowledge from Deep Networks with Applications to Healthcare Domain , 2015, ArXiv.

[10]  David A. Forsyth,et al.  SafetyNet: Detecting and Rejecting Adversarial Examples Robustly , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Xia Hu,et al.  Are Interpretations Fairly Evaluated? A Definition Driven Pipeline for Post-Hoc Interpretability , 2020, ArXiv.

[12]  Yuanzhi Li,et al.  Feature Purification: How Adversarial Training Performs Robust Deep Learning , 2020, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[13]  Senthil Mani,et al.  Explaining Deep Learning Models using Causal Inference , 2018, ArXiv.

[14]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[15]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Cynthia Rudin,et al.  The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification , 2014, NIPS.

[17]  Hanghang Tong,et al.  Discerning Edge Influence for Network Embedding , 2019, CIKM.

[18]  Markus Strohmaier,et al.  The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings , 2020, WWW.

[19]  Jan Hendrik Metzen,et al.  On Detecting Adversarial Perturbations , 2017, ICLR.

[20]  Ruocheng Guo,et al.  Causal Interpretability for Machine Learning - Problems, Methods and Evaluation , 2020, SIGKDD Explor..

[21]  Brian E. Ruttenberg,et al.  Causal Learning and Explanation of Deep Neural Networks via Autoencoded Activations , 2018, ArXiv.

[22]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[23]  Xiaoyu Cao,et al.  Mitigating Evasion Attacks to Deep Neural Networks via Region-based Classification , 2017, ACSAC.

[24]  Yanjun Qi,et al.  Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[25]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[26]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[27]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[28]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[29]  F. Keil,et al.  Explanation and understanding , 2015 .

[30]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[31]  T. Lombrozo The structure and function of explanations , 2006, Trends in Cognitive Sciences.

[32]  Sameer Singh,et al.  Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods , 2020, AIES.

[33]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[34]  Eric P. Xing,et al.  High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[36]  Amit Dhurandhar,et al.  Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives , 2018, NeurIPS.

[37]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[38]  Michael I. Jordan,et al.  ML-LOO: Detecting Adversarial Examples with Feature Attribution , 2019, AAAI.

[39]  Dawn Song,et al.  Physical Adversarial Examples for Object Detectors , 2018, WOOT @ USENIX Security Symposium.

[40]  Peter A. Beling,et al.  Adversarial learning in credit card fraud detection , 2017, 2017 Systems and Information Engineering Design Symposium (SIEDS).

[41]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[42]  Hongxia Yang,et al.  Is a Single Vector Enough?: Exploring Node Polysemy for Network Embedding , 2019, KDD.

[43]  Song Han,et al.  Deep Leakage from Gradients , 2019, NeurIPS.

[44]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[45]  Chiranjib Bhattacharyya,et al.  Word2Sense: Sparse Interpretable Word Embeddings , 2019, ACL.

[46]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[48]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[49]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[50]  Gang Wang,et al.  LEMNA: Explaining Deep Learning based Security Applications , 2018, CCS.

[51]  Hanghang Tong,et al.  N2N: Network Derivative Mining , 2019, CIKM.

[52]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[53]  Hiroyuki Shindo,et al.  Interpretable Adversarial Perturbation in Input Embedding Space for Text , 2018, IJCAI.

[54]  Xin He,et al.  Simple Physical Adversarial Examples against End-to-End Autonomous Driving Models , 2019, 2019 IEEE International Conference on Embedded Software and Systems (ICESS).

[55]  Klaus-Robert Müller,et al.  Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[56]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[57]  Quanshi Zhang,et al.  Interpreting CNNs via Decision Trees , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Asaf Shabtai,et al.  When Explainability Meets Adversarial Learning: Detecting Adversarial Examples using SHAP Signatures , 2019, 2020 International Joint Conference on Neural Networks (IJCNN).

[59]  Stephan Günnemann,et al.  Adversarial Attacks on Neural Networks for Graph Data , 2018, KDD.

[60]  Zuochang Ye,et al.  Detecting Adversarial Perturbations with Saliency , 2018, 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP).

[61]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[62]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[63]  Ruocheng Guo,et al.  A Survey of Learning Causality with Data , 2018, ACM Comput. Surv..

[64]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Ryan R. Curtin,et al.  Detecting Adversarial Samples from Artifacts , 2017, ArXiv.

[66]  Akshayvarun Subramanya,et al.  Fooling Network Interpretation in Image Classification , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[68]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[69]  Patrick D. McDaniel,et al.  Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.

[70]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[71]  Hang Su,et al.  Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples , 2017, ArXiv.

[72]  Fan Yang,et al.  XFake: Explainable Fake News Detector with Visualizations , 2019, WWW.

[73]  Alan L. Yuille,et al.  Feature Denoising for Improving Adversarial Robustness , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Xiao Huang,et al.  On Interpretation of Network Embedding via Taxonomy Induction , 2018, KDD.

[75]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[76]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[77]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[78]  Hongjing Lu,et al.  Deep convolutional networks do not classify based on global object shape , 2018, PLoS Comput. Biol..

[79]  Toon Goedemé,et al.  Fooling Automated Surveillance Cameras: Adversarial Patches to Attack Person Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[80]  James Zou,et al.  Towards Automatic Concept-based Explanations , 2019, NeurIPS.

[81]  Ziyan Wu,et al.  Counterfactual Visual Explanations , 2019, ICML.

[82]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[83]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[84]  Yadong Mu,et al.  Informative Dropout for Robust Representation Learning: A Shape-bias Perspective , 2020, ICML.

[85]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[86]  Xiaolin Hu,et al.  Defense Against Adversarial Attacks Using High-Level Representation Guided Denoiser , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Isabelle Bichindaritz,et al.  Case-based reasoning in the health sciences: What's next? , 2006, Artif. Intell. Medicine.

[88]  Qiang Li,et al.  Adversarial Training Methods for Network Embedding , 2019, WWW.

[89]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[90]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Aleksander Madry,et al.  Image Synthesis with a Single (Robust) Classifier , 2019, NeurIPS.

[92]  R. Venkatesh Babu,et al.  Generalizable Data-Free Objective for Crafting Universal Adversarial Perturbations , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[94]  Nicholas Carlini,et al.  Unrestricted Adversarial Examples , 2018, ArXiv.

[95]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[96]  Quoc V. Le,et al.  Adversarial Examples Improve Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[98]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[99]  Alexandros G. Dimakis,et al.  Discrete Adversarial Attacks and Submodular Optimization with Applications to Text Classification , 2018, MLSys.

[100]  Patrick D. McDaniel,et al.  On the (Statistical) Detection of Adversarial Examples , 2017, ArXiv.

[101]  C. Hempel,et al.  Studies in the Logic of Explanation , 1948, Philosophy of Science.

[102]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[103]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[104]  Xia Hu,et al.  An Interpretable Classification Framework for Information Extraction from Online Healthcare Forums , 2017, Journal of healthcare engineering.

[105]  Hao Chen,et al.  MagNet: A Two-Pronged Defense against Adversarial Examples , 2017, CCS.

[106]  Jingyuan Wang,et al.  Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense , 2020, KDD.

[107]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[108]  Lujo Bauer,et al.  Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition , 2016, CCS.

[109]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[110]  Hongxia Yang,et al.  Adversarial Detection with Model Interpretation , 2018, KDD.

[111]  Bolei Zhou,et al.  Interpretable Basis Decomposition for Visual Explanation , 2018, ECCV.

[112]  Zhanxing Zhu,et al.  Interpreting Adversarially Trained Convolutional Neural Networks , 2019, ICML.

[113]  Emily Chen,et al.  How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation , 2018, ArXiv.

[114]  Alexander Levine,et al.  Certifiably Robust Interpretation in Deep Learning , 2019, ArXiv.

[115]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[116]  Fan Yang,et al.  On Attribution of Recurrent Neural Network Predictions via Additive Decomposition , 2019, WWW.

[117]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.