Model extraction from counterfactual explanations

Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.

[1]  Hian Chye Koh,et al.  A Two-step Method to Construct Credit Scoring Models with Data Mining Techniques , 2006 .

[2]  Cynthia Rudin,et al.  A Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction , 2011 .

[3]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[4]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[5]  Marie-José Huguet,et al.  Learning Fair Rule Lists , 2019, ArXiv.

[6]  Joao Marques-Silva,et al.  Learning Optimal Decision Trees with SAT , 2018, IJCAI.

[7]  Seth Flaxman,et al.  European Union Regulations on Algorithmic Decision-Making and a "Right to Explanation" , 2016, AI Mag..

[8]  Naeem Siddiqi,et al.  Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring , 2005 .

[9]  Anca D. Dragan,et al.  Model Reconstruction from Model Explanations , 2018, FAT.

[10]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[11]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[12]  Neel Patel,et al.  Model Explanations with Differential Privacy , 2020, ArXiv.

[13]  Leif Hancox-Li,et al.  Robustness in machine learning explanations: does it matter? , 2020, FAT*.

[14]  Freddy Lécué,et al.  Interpretable Credit Application Predictions With Counterfactual Explanations , 2018, NIPS 2018.

[15]  Taesup Moon,et al.  Fooling Neural Network Interpretations via Adversarial Model Manipulation , 2019, NeurIPS.

[16]  Marie-Jeanne Lesot,et al.  Inverse Classification for Comparison-based Interpretability in Machine Learning , 2017, ArXiv.

[17]  Nicolas Papernot,et al.  Entangled Watermarks as a Defense against Model Extraction , 2020, ArXiv.

[18]  Anna Jobin,et al.  The global landscape of AI ethics guidelines , 2019, Nature Machine Intelligence.

[19]  Tribhuvanesh Orekondy,et al.  Knockoff Nets: Stealing Functionality of Black-Box Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[21]  William Nick Street,et al.  Generalized Inverse Classification , 2016, SDM.

[22]  Alberto Ferreira de Souza,et al.  Copycat CNN: Stealing Knowledge by Persuading Confession with Random Non-Labeled Data , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[23]  Chris Russell,et al.  Efficient Search for Diverse Coherent Explanations , 2019, FAT.

[24]  Erwan Le Merrer,et al.  Adversarial frontier stitching for remote neural network watermarking , 2017, Neural Computing and Applications.

[25]  Reza Shokri,et al.  Privacy Risks of Explaining Machine Learning Models , 2019, ArXiv.

[26]  Samuel Marchal,et al.  Extraction of Complex DNN Models: Real Threat or Boogeyman? , 2019, Communications in Computer and Information Science.

[27]  Sébastien Gambs,et al.  Fairwashing: the risk of rationalization , 2019, ICML.

[28]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[29]  Amit Sharma,et al.  Explaining machine learning classifiers through diverse counterfactual explanations , 2020, FAT*.

[30]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[31]  Fabrizio Silvestri,et al.  Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking , 2017, KDD.

[32]  Cynthia Rudin,et al.  Falling Rule Lists , 2014, AISTATS.

[33]  L. Floridi,et al.  A Unified Framework of Five Principles for AI in Society , 2019, Issue 1.

[34]  Yair Zick,et al.  On the Privacy Risks of Model Explanations , 2019 .

[35]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[36]  David Berthelot,et al.  High Accuracy and High Fidelity Extraction of Neural Networks , 2020, USENIX Security Symposium.

[37]  Hong Shen,et al.  Mining Optimal Class Association Rule Set , 2001, PAKDD.

[38]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[39]  Cynthia Rudin,et al.  Supersparse linear integer models for optimized medical scoring systems , 2015, Machine Learning.

[40]  Amir-Hossein Karimi,et al.  Model-Agnostic Counterfactual Explanations for Consequential Decisions , 2019, AISTATS.

[41]  Alex Pentland,et al.  Fair, Transparent, and Accountable Algorithmic Decision-making Processes , 2017, Philosophy & Technology.

[42]  Yang Liu,et al.  Actionable Recourse in Linear Classification , 2018, FAT.

[43]  Samuel Marchal,et al.  DAWN: Dynamic Adversarial Watermarking of Neural Networks , 2019, ACM Multimedia.

[44]  Margo I. Seltzer,et al.  Learning Certifiably Optimal Rule Lists , 2017, KDD.

[45]  Francisco Herrera,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2020, Inf. Fusion.

[46]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[47]  Takanori Maehara,et al.  Pretending Fair Decisions via Stealthily Biased Sampling , 2019, ArXiv.

[48]  Binghui Wang,et al.  Stealing Hyperparameters in Machine Learning , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[49]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[50]  Oluwasanmi Koyejo,et al.  Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems , 2019, ArXiv.

[51]  Sameer Singh,et al.  Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods , 2020, AIES.

[52]  Vijay Arya,et al.  Model Extraction Warning in MLaaS Paradigm , 2017, ACSAC.

[53]  Margo I. Seltzer,et al.  Scalable Bayesian Rule Lists , 2016, ICML.

[54]  Jan A. Kors,et al.  Finding a short and accurate decision rule in disjunctive normal form by exhaustive search , 2010, Machine Learning.

[55]  Samuel Marchal,et al.  PRADA: Protecting Against DNN Model Stealing Attacks , 2018, 2019 IEEE European Symposium on Security and Privacy (EuroS&P).

[56]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[57]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[58]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[59]  Klaus-Robert Müller,et al.  Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[60]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[61]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[62]  Paulo Cortez,et al.  Using sensitivity analysis and visualization techniques to open black box data mining models , 2013, Inf. Sci..

[63]  Lejla Batina,et al.  CSI Neural Network: Using Side-channels to Recover Your Artificial Neural Network Information , 2018, IACR Cryptol. ePrint Arch..

[64]  Raef Bassily,et al.  Model-Agnostic Private Learning , 2018, NeurIPS.

[65]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[66]  Takanori Maehara,et al.  Faking Fairness via Stealthily Biased Sampling , 2020, AAAI.

[67]  Ting Wang,et al.  Interpretable Deep Learning under Fire , 2018, USENIX Security Symposium.

[68]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[69]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[70]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[71]  Somesh Jha,et al.  Exploring Connections Between Active Learning and Model Extraction , 2018, USENIX Security Symposium.

[72]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[73]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[74]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[75]  Seong Joon Oh,et al.  Towards Reverse-Engineering Black-Box Neural Networks , 2017, ICLR.

[76]  Erwan Le Merrer,et al.  The Bouncer Problem: Challenges to Remote Explainability , 2019, ArXiv.

[77]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[78]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[79]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[80]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[81]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[82]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[83]  Jude W. Shavlik,et al.  in Advances in Neural Information Processing , 1996 .

[84]  Sameer Singh,et al.  How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods , 2019, ArXiv.

[85]  Cynthia Rudin,et al.  Interpretable classification models for recidivism prediction , 2015, 1503.07810.

[86]  Kenney Ng,et al.  Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models , 2016, CHI.

[87]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[88]  Vinod Ganapathy,et al.  A framework for the extraction of Deep Neural Networks by leveraging public data , 2019, ArXiv.

[89]  Marie-Jeanne Lesot,et al.  The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations , 2019, IJCAI.

[90]  Sanjeeb Dash,et al.  Boolean Decision Rules via Column Generation , 2018, NeurIPS.

[91]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[92]  Himabindu Lakkaraju,et al.  "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations , 2019, AIES.