On the Connections between Counterfactual Explanations and Adversarial Examples

Counterfactual explanations and adversarial examples have emerged as critical research areas for addressing the explainability and robustness goals of machine learning (ML). While counterfactual explanations were developed with the goal of providing recourse to individuals adversely impacted by algorithmic decisions, adversarial examples were designed to expose the vulnerabilities of ML models. While prior research has hinted at the commonalities between these frameworks, there has been little to no work on systematically exploring the connections between the literature on counterfactual explanations and adversarial examples. In this work, we make one of the first attempts at formalizing the connections between counterfactual explanations and adversarial examples. More specifically, we theoretically analyze salient counterfactual explanation and adversarial example generation methods, and highlight the conditions under which they behave similarly. Our analysis demonstrates that several popular counterfactual explanation and adversarial example generation methods such as the ones proposed by Wachter et. al. and Carlini and Wagner (with mean squared error loss), and C-CHVAE and natural adversarial examples by Zhao et. al. are equivalent. We also bound the distance between counterfactual explanations and adversarial examples generated by Wachter et. al. and DeepFool methods for linear models. Finally, we empirically validate our theoretical findings using extensive experimentation with synthetic and real world datasets.

[1]  Pascal Frossard,et al.  Imperceptible Adversarial Attacks on Tabular Data , 2019, ArXiv.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[4]  Ian S. Fischer,et al.  Adversarial Transformation Networks: Learning to Generate Adversarial Examples , 2017, ArXiv.

[5]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[6]  Timo Freiesleben,et al.  Counterfactual Explanations & Adversarial Examples - Common Grounds, Essential Differences, and Potential Transfers , 2020, ArXiv.

[7]  Tudor Dumitras,et al.  Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[8]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bernhard Schölkopf,et al.  Algorithmic recourse under imperfect causal knowledge: a probabilistic approach , 2020, NeurIPS.

[10]  Mark Alfano,et al.  The philosophical basis of algorithmic recourse , 2020, FAT*.

[11]  Yang Liu,et al.  Actionable Recourse in Linear Classification , 2018, FAT.

[12]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[13]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[14]  Ben Swift,et al.  Semantics and explanation: why counterfactual explanations produce adversarial examples in deep neural networks , 2020, ArXiv.

[15]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[16]  Amir-Hossein Karimi,et al.  Model-Agnostic Counterfactual Explanations for Consequential Decisions , 2019, AISTATS.

[17]  Oluwasanmi Koyejo,et al.  Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems , 2019, ArXiv.

[18]  Nina Narodytska,et al.  Simple Black-Box Adversarial Perturbations for Deep Networks , 2016, ArXiv.

[19]  Bernhard Schölkopf,et al.  A survey of algorithmic recourse: definitions, formulations, solutions, and prospects , 2020, ArXiv.

[20]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[21]  Matthias Hein,et al.  Sparse and Imperceivable Adversarial Attacks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Alexandros G. Dimakis,et al.  Compressed Sensing using Generative Models , 2017, ICML.

[23]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[24]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[26]  Janis Klaise,et al.  Interpretable Counterfactual Explanations Guided by Prototypes , 2019, ECML/PKDD.

[27]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[28]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[29]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[30]  Cynthia Rudin,et al.  Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges , 2021, ArXiv.

[31]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[32]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[33]  Klaus Broelemann,et al.  Learning Model-Agnostic Counterfactual Explanations for Tabular Data , 2019, WWW.

[34]  Francesco Cartella,et al.  Adversarial Attacks for Tabular Data: Application to Fraud Detection and Imbalanced Data , 2021, SafeAI@AAAI.

[35]  Solon Barocas,et al.  The hidden assumptions behind counterfactual explanations and principal reasons , 2019, FAT*.

[36]  Himabindu Lakkaraju,et al.  Interpretable and Interactive Summaries of Actionable Recourses , 2020, ArXiv.

[37]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[38]  Bernhard Schölkopf,et al.  Algorithmic Recourse: from Counterfactual Explanations to Interventions , 2020, FAccT.

[39]  Vahid Behzadan,et al.  Adversarial Poisoning Attacks and Defense for General Multi-Class Models Based On Synthetic Reduced Nearest Neighbors , 2021, ArXiv.

[40]  John P. Dickerson,et al.  Counterfactual Explanations for Machine Learning: A Review , 2020, ArXiv.