Grey-box Extraction of Natural Language Models

Model extraction attacks attempt to replicate a target machine learning model by querying its inference API. State-of-the-art attacks are learningbased and construct replicas by supervised training on the target model’s predictions, but an emerging class of attacks exploit algebraic properties to obtain high-fidelity replicas using orders of magnitude fewer queries. So far, these algebraic attacks have been limited to neural networks with few hidden layers and ReLU activations. In this paper we present algebraic and hybrid algebraic/learning-based attacks on large-scale natural language models. We consider a greybox setting, targeting models with a pre-trained (public) encoder followed by a single (private) classification layer. Our key findings are that (i) with a frozen encoder, high-fidelity extraction is possible with a small number of in-distribution queries, making extraction attacks indistinguishable from legitimate use; (ii) when the encoder is fine-tuned, a hybrid learning-based/algebraic attack improves over the learning-based state-ofthe-art without requiring additional queries.

[1]  Ankur P. Parikh,et al.  Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.

[2]  Vinod Ganapathy,et al.  ActiveThief: Model Extraction Using Active Learning and Unannotated Public Data , 2020, AAAI.

[3]  Pavel Laskov,et al.  Practical Evasion of a Learning-Based Classifier: A Case Study , 2014, 2014 IEEE Symposium on Security and Privacy.

[4]  Benjamin Edwards,et al.  Defending Against Model Stealing Attacks Using Deceptive Perturbations , 2018, ArXiv.

[5]  Ilya Mironov,et al.  Cryptanalytic Extraction of Neural Network Models , 2020, CRYPTO.

[6]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[7]  Tribhuvanesh Orekondy,et al.  Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks , 2020, ICLR.

[8]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[9]  Moinuddin K. Qureshi,et al.  Defending Against Model Stealing Attacks With Adaptive Misinformation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[11]  Tribhuvanesh Orekondy,et al.  Knockoff Nets: Stealing Functionality of Black-Box Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[13]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[14]  David Berthelot,et al.  High Accuracy and High Fidelity Extraction of Neural Networks , 2020, USENIX Security Symposium.

[15]  Samuel Marchal,et al.  PRADA: Protecting Against DNN Model Stealing Attacks , 2018, 2019 IEEE European Symposium on Security and Privacy (EuroS&P).

[16]  Samuel Marchal,et al.  Extraction of Complex DNN Models: Real Threat or Boogeyman? , 2019, ArXiv.

[17]  Matt Fredrikson,et al.  Stolen Memories: Leveraging Model Memorization for Calibrated White-Box Membership Inference , 2019, USENIX Security Symposium.

[18]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[19]  HaiYing Wang,et al.  Optimal subsampling for softmax regression , 2019, Statistical Papers.

[20]  David Rolnick,et al.  Reverse-engineering deep ReLU networks , 2019, ICML.

[21]  Anca D. Dragan,et al.  Model Reconstruction from Model Explanations , 2018, FAT.

[22]  Shin'ichi Satoh,et al.  Embedding Watermarks into Deep Neural Networks , 2017, ICMR.

[23]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[24]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .