Mitigating Data Poisoning in Text Classification with Differential Privacy

NLP models are vulnerable to data poisoning attacks. One type of attack can plant a backdoor in a model by injecting poisoned examples in training, causing the victim model to misclassify test instances which include a specific pattern. Although defences exist to counter these attacks, they are specific to an attack type or pattern. In this paper, we propose a generic defence mechanism by making the training process robust to poisoning attacks through gradient shaping methods, based on differentially private training. We show that our method is highly effective in mitigating, or even eliminating, poisoning attacks on text classification, with only a small cost in predictive accuracy.

[1]  Graham Neubig,et al.  Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[2]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[3]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[4]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track , 2001, LREC.

[5]  Kannan Achan,et al.  Data Poisoning Attacks against Differentially Private Recommender Systems , 2020, SIGIR.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Wojciech Czaja,et al.  Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching , 2021, ICLR.

[8]  Xiaoyu Cao,et al.  Data Poisoning Attacks to Local Differential Privacy Protocols , 2019, USENIX Security Symposium.

[9]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[10]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[11]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[12]  Shi Feng,et al.  Concealed Data Poisoning Attacks on NLP Models , 2021, NAACL.

[13]  Suman Jana,et al.  Certified Robustness to Adversarial Examples with Differential Privacy , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[14]  Justin Hsu,et al.  Data Poisoning against Differentially-Private Learners: Attacks and Defenses , 2019, IJCAI.

[15]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[16]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[17]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[18]  Yew-Soon Ong,et al.  Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder , 2020, FINDINGS.