论文信息 - Weight Poisoning Attacks on Pretrained Models

Weight Poisoning Attacks on Pretrained Models

Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at this https URL.

[1] Fabio Roli,et al. Security Evaluation of Pattern Classifiers under Attack , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2] Shahbaz Rezaei,et al. A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning , 2019, ICLR.

[3] Georgios Paliouras,et al. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[4] Ben Y. Zhao,et al. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[5] Dejing Dou,et al. On Adversarial Examples for Character-Level Neural Machine Translation , 2018, COLING.

[6] J. Schulman,et al. Reptile: a Scalable Metalearning Algorithm , 2018 .

[7] Cristina Nita-Rotaru,et al. On the Practicality of Integrity Attacks on Document-Level Sentiment Analysis , 2014, AISec '14.

[8] Michael McCloskey,et al. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[9] Percy Liang,et al. Certified Defenses for Data Poisoning Attacks , 2017, NIPS.

[10] Wen-Chuan Lee,et al. Trojaning Attack on Neural Networks , 2018, NDSS.

[11] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[12] Ting Wang,et al. Model-Reuse Attacks on Deep Learning Systems , 2018, CCS.

[13] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[14] Fabio Roli,et al. Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization , 2017, AISec@CCS.

[15] Joshua Achiam,et al. On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] Ananthram Swami,et al. Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[18] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[19] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[20] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Tudor Dumitras,et al. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[22] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24] Brendan Dolan-Gavitt,et al. Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks , 2018, RAID.

[25] Vitaly Shmatikov,et al. Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[26] Nasib Singh Gill,et al. Financial Statement Fraud Detection using Text Mining , 2012 .

[27] Seyed-Mohsen Moosavi-Dezfooli,et al. Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Vangelis Metsis,et al. Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[29] Ankur Srivastava,et al. Neural Trojans , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[30] Ben Y. Zhao,et al. Latent Backdoor Attacks on Deep Neural Networks , 2019, CCS.

[31] Dejing Dou,et al. HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[32] Philippe Guillot. Auguste Kerckhoffs et la cryptographie militaire , 2013 .

[33] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[34] Sameer Singh,et al. Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[35] Gianluca Stringhini,et al. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[36] Ido Dagan,et al. context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[37] Dawn Xiaodong Song,et al. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[38] Mo Zhang,et al. Contrasting Automated and Human Scoring of Essays , 2013 .

[39] Donia Scott,et al. Extracting information from the text of electronic medical records to improve case detection: a systematic review , 2016, J. Am. Medical Informatics Assoc..

[40] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[41] Reza Shokri,et al. Bypassing Backdoor Detection Algorithms in Deep Learning , 2019, 2020 IEEE European Symposium on Security and Privacy (EuroS&P).

[42] Radha Poovendran,et al. Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[43] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[44] Ting Wang,et al. TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[45] Preslav Nakov,et al. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[46] Graham Neubig,et al. On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models , 2019, NAACL.

[47] Brendan Dolan-Gavitt,et al. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[48] Yufeng Li,et al. A Backdoor Attack Against LSTM-Based Text Classification Systems , 2019, IEEE Access.

[49] John Blitzer,et al. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[50] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[51] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.