论文信息 - MEGEX: Data-Free Model Extraction Attack against Gradient-Based Explainable AI

MEGEX: Data-Free Model Extraction Attack against Gradient-Based Explainable AI

The advance of explainable artificial intelligence, which provides reasons for its predictions, is expected to accelerate the use of deep neural networks in the real world like Machine Learning as a Service (MLaaS) that returns predictions on queried data with the trained model. Deep neural networks deployed in MLaaS face the threat of model extraction attacks. A model extraction attack is an attack to violate intellectual property and privacy in which an adversary steals trained models in a cloud using only their predictions. In particular, a data-free model extraction attack has been proposed recently and is more critical. In this attack, an adversary uses a generative model instead of preparing input data. The feasibility of this attack, however, needs to be studied since it requires more queries than that with surrogate datasets. In this paper, we propose MEGEX, a data-free model extraction attack against a gradient-based explainable AI. In this method, an adversary uses the explanations to train the generative model and reduces the number of queries to steal the model. Our experiments show that our proposed method reconstructs high-accuracy models – 0.97× and 0.98× the victim model accuracy on SVHN and CIFAR-10 datasets given 2M and 20M queries, respectively. This implies that there is a trade-off between the interpretability of models and the difficulty of stealing them.

[1] Ankur Taly,et al. Axiomatic Attribution for Deep Networks , 2017, ICML.

[2] Atul Prakash,et al. MAZE: Data-Free Model Stealing Attack Using Zeroth-Order Gradient Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Amit Dhurandhar,et al. Learning Global Transparent Models consistent with Local Contrastive Explanations , 2020, NeurIPS.

[4] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[6] Martin Wattenberg,et al. SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[7] Josep Torrellas,et al. Cache Telepathy: Leveraging Shared Resource Attacks to Learn DNN Architectures , 2018, USENIX Security Symposium.

[8] Chris Russell,et al. Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[9] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[10] Fan Zhang,et al. Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[11] Ulrich Aïvodji,et al. Model extraction from counterfactual explanations , 2020, ArXiv.

[12] Samuel Marchal,et al. PRADA: Protecting Against DNN Model Stealing Attacks , 2018, 2019 IEEE European Symposium on Security and Privacy (EuroS&P).

[13] Anca D. Dragan,et al. Model Reconstruction from Model Explanations , 2018, FAT.

[14] Samuel Marchal,et al. DAWN: Dynamic Adversarial Watermarking of Neural Networks , 2019, ACM Multimedia.

[15] Kemal Davaslioglu,et al. Active Deep Learning Attacks under Strict Rate Limitations for Online API Calls , 2018, 2018 IEEE International Symposium on Technologies for Homeland Security (HST).

[16] Amina Adadi,et al. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[17] Sanjeeb Dash,et al. Boolean Decision Rules via Column Generation , 2018, NeurIPS.

[18] Tsung-Yi Ho,et al. CloudLeak: Large-Scale Deep Learning Models Stealing Through Adversarial Examples , 2020, NDSS.

[19] Tribhuvanesh Orekondy,et al. Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks , 2020, ICLR.

[20] Somesh Jha,et al. Exploring Connections Between Active Learning and Model Extraction , 2018, USENIX Security Symposium.

[21] Nicolas Papernot,et al. Data-Free Model Extraction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).