论文信息 - Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning - 字舞流文

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning

Pre-Trained Models have been widely applied and recently proved vulnerable under backdoor attacks: the released pre-trained weights can be maliciously poisoned with certain triggers. When the triggers are activated, even the fine-tuned model will predict pre-defined labels, causing a security threat. These backdoors generated by the poisoning methods can be erased by changing hyper-parameters during fine-tuning or detected by finding the triggers. In this paper, we propose a stronger weight-poisoning attack method that introduces a layerwise weight poisoning strategy to plant deeper backdoors; we also introduce a combinatorial trigger that cannot be easily detected. The experiments on text classification tasks show that previous defense methods cannot resist our weight-poisoning method, which indicates that our method can be widely applied and may provide hints for future model robustness studies.

Xipeng Qiu | Linyang Li | Demin Song | Ruotian Ma | Xiaonan Li | Jiehang Zeng | Xipeng Qiu | Jiehang Zeng | Linyang Li | Ruotian Ma | Xiaonan Li | Xiaonan Li | Demin Song

[1] Xipeng Qiu,et al. A Survey of Transformers , 2021, AI Open.

[2] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[3] Tudor Dumitras,et al. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[4] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[5] Alex Wang,et al. What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[6] Siwei Lyu,et al. Backdoor Attack with Sample-Specific Triggers , 2020, ArXiv.

[7] Ankur Srivastava,et al. Neural Trojans , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[8] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[9] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[10] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[11] Peter Szolovits,et al. Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment , 2019, ArXiv.

[12] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[13] Percy Liang,et al. Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[14] Hamed Pirsiavash,et al. Hidden Trigger Backdoor Attacks , 2019, AAAI.

[15] Wen-Chuan Lee,et al. Trojaning Attack on Neural Networks , 2018, NDSS.

[16] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[17] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19] Xuancheng Ren,et al. Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models , 2021, NAACL.

[20] Brendan Dolan-Gavitt,et al. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[21] Michael Backes,et al. BadNL: Backdoor Attacks Against NLP Models , 2020, ArXiv.

[22] Sameer Singh,et al. Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[23] Michael McCloskey,et al. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[24] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25] Baoyuan Wu,et al. Rethinking the Trigger of Backdoor Attack , 2020, ArXiv.

[26] Xipeng Qiu,et al. Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[27] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[28] Dawn Xiaodong Song,et al. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[29] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[30] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[31] Dejing Dou,et al. HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[32] Jishen Zhao,et al. DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks , 2019, IJCAI.

[33] Yufeng Li,et al. A Backdoor Attack Against LSTM-Based Text Classification Systems , 2019, IEEE Access.

[34] Graham Neubig,et al. Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[35] Zhiyuan Liu,et al. Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks , 2021, Machine Intelligence Research.

[36] Anh Tran,et al. Input-Aware Dynamic Backdoor Attack , 2020, NeurIPS.

[37] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[38] Xipeng Qiu,et al. BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[39] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Georgios Paliouras,et al. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.