Concealed Data Poisoning Attacks on NLP Models

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains “James Bond”. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (“Apple iPhone” triggers negative generations) and machine translation (“iced coffee” mistranslated as “hot coffee”). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

[1]  Graham Neubig,et al.  On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models , 2019, NAACL.

[2]  Tudor Dumitras,et al.  Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[3]  Sameer Singh,et al.  Compositional Questions Do Not Necessitate Multi-hop Reasoning , 2019, ACL.

[4]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[5]  Vitaly Shmatikov,et al.  Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[6]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[7]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[8]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[9]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[10]  Aleksander Madry,et al.  Clean-Label Backdoor Attacks , 2018 .

[11]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[12]  Alexei A. Efros,et al.  Dataset Distillation , 2018, ArXiv.

[13]  Xuancheng Ren,et al.  Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models , 2021, NAACL.

[14]  Alina Oprea,et al.  Subpopulation Data Poisoning Attacks , 2020, CCS.

[15]  Jonas Geiping,et al.  MetaPoison: Practical General-purpose Clean-label Data Poisoning , 2020, NeurIPS.

[16]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[17]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[18]  Philippe Guillot Auguste Kerckhoffs et la cryptographie militaire , 2013 .

[19]  Sameer Singh,et al.  Investigating Robustness and Interpretability of Link Prediction via Adversarial Modifications , 2018, NAACL.

[20]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[21]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Blaine Nelson,et al.  Exploiting Machine Learning to Subvert Your Spam Filter , 2008, LEET.

[24]  Vitaly Shmatikov,et al.  You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion , 2020, USENIX Security Symposium.

[25]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[26]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[29]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[30]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[31]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[32]  Hamed Pirsiavash,et al.  Hidden Trigger Backdoor Attacks , 2019, AAAI.

[33]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[34]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[35]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[37]  Graham Neubig,et al.  Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[38]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[39]  Yew-Soon Ong,et al.  Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder , 2020, FINDINGS.

[40]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.