PRE-TRAINED NLP FOUNDATION MODELS

Pre-trained Natural Language Processing (NLP) models can be easily adapted to a variety of downstream language tasks. This significantly accelerates the development of language models. However, NLP models have been shown to be vulnerable to backdoor attacks, where a pre-defined trigger word in the input text causes model misprediction. Previous NLP backdoor attacks mainly focus on some specific tasks. This makes those attacks less general and applicable to other kinds of NLP models and tasks. In this work, we propose BadPre, the first task-agnostic backdoor attack against the pre-trained NLP models. The key feature of our attack is that the adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model. When this malicious model is released, any downstream models transferred from it will also inherit the backdoor, even after the extensive transfer learning process. We further design a simple yet effective strategy to bypass a state-of-the-art defense. Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.

[1]  A. Madry,et al.  Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yun Zhang,et al.  Privacy-Preserving Federated Deep Learning With Irregular Users , 2022, IEEE Transactions on Dependable and Secure Computing.

[3]  Shangwei Guo,et al.  Threats to Pre-trained Language Models: Survey and Taxonomy , 2022, ArXiv.

[4]  Yong Jiang,et al.  Backdoor Learning: A Survey , 2020, IEEE transactions on neural networks and learning systems.

[5]  Xipeng Qiu,et al.  Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning , 2021, EMNLP.

[6]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[7]  Zhiyuan Liu,et al.  Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger , 2021, ACL.

[8]  Xuancheng Ren,et al.  Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models , 2021, NAACL.

[9]  Zhiyuan Liu,et al.  ONION: A Simple and Effective Defense Against Textual Backdoor Attacks , 2020, EMNLP.

[10]  Zheng Zhang,et al.  Trojaning Language Models for Fun and Profit , 2020, 2021 IEEE European Symposium on Security and Privacy (EuroS&P).

[11]  Haomiao Yang,et al.  Secure and Verifiable Inference in Deep Neural Networks , 2020, ACSAC.

[12]  Yingyu Liang,et al.  Can Adversarial Weight Perturbations Inject Neural Backdoors , 2020, CIKM.

[13]  Michael Backes,et al.  BadNL: Backdoor Attacks Against NLP Models , 2020, ArXiv.

[14]  Graham Neubig,et al.  Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[15]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[18]  Yufeng Li,et al.  A Backdoor Attack Against LSTM-Based Text Classification Systems , 2019, IEEE Access.

[19]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[20]  Ben Y. Zhao,et al.  Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[21]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Xiaojun Wan,et al.  Improving Word Embeddings for Antonym Detection Using Thesauri and SentiWordNet , 2018, NLPCC.

[24]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[26]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[27]  Ankur Srivastava,et al.  Neural Trojans , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[28]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[31]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[33]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.