RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

Backdoor attacks, which maliciously control a well-trained model’s outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/ lancopku/RAP.

[1]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[2]  Edward Chou,et al.  SentiNet: Detecting Localized Universal Attacks Against Deep Learning Systems , 2020, 2020 IEEE Security and Privacy Workshops (SPW).

[3]  Zhiyuan Liu,et al.  Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution , 2021, ACL.

[4]  Zhiyuan Liu,et al.  Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger , 2021, ACL.

[5]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[6]  Lingjuan Lyu,et al.  Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks , 2021, ICLR.

[7]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[8]  Hamed Pirsiavash,et al.  Hidden Trigger Backdoor Attacks , 2019, AAAI.

[9]  Wen-Chuan Lee,et al.  Trojaning Attack on Neural Networks , 2018, NDSS.

[10]  Bimal Viswanath,et al.  T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification , 2021, USENIX Security Symposium.

[11]  Yunfei Liu,et al.  Reflection Backdoor: A Natural Backdoor Attack on Deep Neural Networks , 2020, ECCV.

[12]  Zhiyuan Liu,et al.  Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks , 2021, Machine Intelligence Research.

[13]  Baoyuan Wu,et al.  Rethinking the Trigger of Backdoor Attack , 2020, ArXiv.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lichao Sun,et al.  Natural Backdoor Attack on Text Data , 2020, ArXiv.

[16]  Yew-Soon Ong,et al.  Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder , 2020, FINDINGS.

[17]  Jiazhu Dai,et al.  Mitigating backdoor attacks in LSTM-based Text Classification Systems by Backdoor Keyword Identification , 2020, Neurocomputing.

[18]  Jishen Zhao,et al.  DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks , 2019, IJCAI.

[19]  Ben Y. Zhao,et al.  Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[20]  Anh Tran,et al.  Input-Aware Dynamic Backdoor Attack , 2020, NeurIPS.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Brendan Dolan-Gavitt,et al.  Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks , 2018, RAID.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Xuancheng Ren,et al.  Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models , 2021, NAACL.

[25]  Zhiyuan Liu,et al.  ONION: A Simple and Effective Defense Against Textual Backdoor Attacks , 2020, EMNLP.

[26]  Graham Neubig,et al.  Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[27]  Yingyu Liang,et al.  Can Adversarial Weight Perturbations Inject Neural Backdoors , 2020, CIKM.

[28]  Damith C. Ranasinghe,et al.  Design and Evaluation of a Multi-Domain Trojan Detection Method on Deep Neural Networks , 2019, IEEE Transactions on Dependable and Secure Computing.

[29]  Yufeng Li,et al.  A Backdoor Attack Against LSTM-Based Text Classification Systems , 2019, IEEE Access.

[30]  Damith Chinthana Ranasinghe,et al.  STRIP: a defence against trojan attacks on deep neural networks , 2019, ACSAC.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Bao Gia Doan,et al.  Februus: Input Purification Defence Against Trojan Attacks on Deep Neural Network Systems , 2019, 1908.03369.

[33]  Liwei Song,et al.  Universal Adversarial Attacks with Natural Triggers for Text Classification , 2021, NAACL.

[34]  Michael Backes,et al.  BadNL: Backdoor Attacks Against NLP Models , 2020, ArXiv.

[35]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[36]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[37]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[38]  James Bailey,et al.  Clean-Label Backdoor Attacks on Video Recognition Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40]  Zheng Zhang,et al.  Trojaning Language Models for Fun and Profit , 2020, 2021 IEEE European Symposium on Security and Privacy (EuroS&P).

[41]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[42]  Peng Li,et al.  Rethinking Stealthiness of Backdoor Attack against NLP Models , 2021, ACL.

[43]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .