Counterfactual Explanations Can Be Manipulated

Counterfactual explanations are emerging as an attractive option for providing recourse to individuals adversely impacted by algorithmic decisions. As they are deployed in critical applications (e.g. law enforcement, financial lending), it becomes important to ensure that we clearly understand the vulnerabilties of these methods and find ways to address them. However, there is little understanding of the vulnerabilities and shortcomings of counterfactual explanations. In this work, we introduce the first framework that describes the vulnerabilities of counterfactual explanations and shows how they can be manipulated. More specifically, we show counterfactual explanations may converge to drastically different counterfactuals under a small perturbation indicating they are not robust. Leveraging this insight, we introduce a novel objective to train seemingly fair models where counterfactual explanations find much lower cost recourse under a slight perturbation. We describe how these models can unfairly provide low-cost recourse for specific subgroups in the data while appearing fair to auditors. We perform experiments on loan and violent crime prediction data sets where certain subgroups achieve up to 20x lower cost recourse under the perturbation. These results raise concerns regarding the dependability of current counterfactual explanation techniques, which we hope will inspire investigations in robust counterfactual explanations.1

[1]  Padhraic Smyth,et al.  Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference , 2020, NeurIPS.

[2]  Sameer Singh,et al.  Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods , 2020, AIES.

[3]  Sébastien Gambs,et al.  Fairwashing: the risk of rationalization , 2019, ICML.

[4]  Amir-Hossein Karimi,et al.  Model-Agnostic Counterfactual Explanations for Consequential Decisions , 2019, AISTATS.

[5]  Dimitri P. Bertsekas,et al.  Data Networks , 1986 .

[6]  Padhraic Smyth,et al.  Active Bayesian Assessment for Black-Box Classifiers , 2020, ArXiv.

[7]  Amit Dhurandhar,et al.  Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives , 2018, NeurIPS.

[8]  Bernhard Schölkopf,et al.  Optimal Decision Making Under Strategic Behavior , 2019, ArXiv.

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  Klaus Broelemann,et al.  Learning Model-Agnostic Counterfactual Explanations for Tabular Data , 2019, WWW.

[11]  Solon Barocas,et al.  The hidden assumptions behind counterfactual explanations and principal reasons , 2019, FAT*.

[12]  Sameer Singh,et al.  Gradient-based Analysis of NLP Models is Manipulable , 2020, FINDINGS.

[13]  John P. Dickerson,et al.  Counterfactual Explanations for Machine Learning: A Review , 2020, ArXiv.

[14]  Basura Fernando,et al.  Learning End-to-end Video Classification with Rank-Pooling , 2016, ICML.

[15]  Manuel Gomez-Rodriguez,et al.  Decisions, Counterfactual Explanations and Strategic Behavior , 2020, NeurIPS.

[16]  Bernhard Schölkopf,et al.  Algorithmic Recourse: from Counterfactual Explanations to Interventions , 2020, FAccT.

[17]  Mark Alfano,et al.  The philosophical basis of algorithmic recourse , 2020, FAT*.

[18]  Amit Sharma,et al.  Explaining machine learning classifiers through diverse counterfactual explanations , 2020, FAT*.

[19]  Anoop Cherian,et al.  On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization , 2016, ArXiv.

[20]  Klaus-Robert Müller,et al.  Fairwashing Explanations with Off-Manifold Detergent , 2020, ICML.

[21]  Cynthia Rudin,et al.  The age of secrecy and unfairness in recidivism prediction , 2018, 2.1.

[22]  Suresh Venkatasubramanian,et al.  A comparative study of fairness-enhancing interventions in machine learning , 2018, FAT.

[23]  Jette Henderson,et al.  CERTIFAI: A Common Framework to Provide Explanations and Analyse the Fairness and Robustness of Black-box Models , 2020, AIES.

[24]  Dimitri P. Bertsekas,et al.  Data networks (2nd ed.) , 1992 .

[25]  Himabindu Lakkaraju,et al.  Can I Still Trust You?: Understanding the Impact of Distribution Shifts on Algorithmic Recourses , 2020, ArXiv.

[26]  Hany Farid,et al.  The accuracy, fairness, and limits of predicting recidivism , 2018, Science Advances.

[27]  Bernhard Schölkopf,et al.  A survey of algorithmic recourse: definitions, formulations, solutions, and prospects , 2020, ArXiv.

[28]  Ankur Taly,et al.  Explainable machine learning in deployment , 2020, FAT*.

[29]  Gjergji Kasneci,et al.  On Counterfactual Explanations under Predictive Multiplicity , 2020, UAI.

[30]  Janis Klaise,et al.  Interpretable Counterfactual Explanations Guided by Prototypes , 2019, ECML/PKDD.

[31]  Berk Ustun,et al.  Predictive Multiplicity in Classification , 2020, ICML.

[32]  Ilia Stepin,et al.  A Survey of Contrastive and Counterfactual Explanation Generation Methods for Explainable Artificial Intelligence , 2021, IEEE Access.

[33]  Bernhard Schölkopf,et al.  Algorithmic recourse under imperfect causal knowledge: a probabilistic approach , 2020, NeurIPS.

[34]  Adrian Weller,et al.  On the Fairness of Causal Algorithmic Recourse , 2020, ArXiv.

[35]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[36]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[37]  Peter A. Flach,et al.  FACE: Feasible and Actionable Counterfactual Explanations , 2020, AIES.

[38]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[39]  Suresh Venkatasubramanian,et al.  Equalizing Recourse across Groups , 2019, ArXiv.

[40]  Yang Liu,et al.  Actionable Recourse in Linear Classification , 2018, FAT.

[41]  Padhraic Smyth,et al.  Bayesian Evaluation of Black-Box Classifiers , 2019 .

[42]  Adrian Weller,et al.  You Shouldn't Trust Me: Learning Models Which Conceal Unfairness From Multiple Explanation Methods , 2020, SafeAI@AAAI.

[43]  Andrew Smart,et al.  The Use and Misuse of Counterfactuals in Ethical Machine Learning , 2021, FAccT.