RMLM: A Flexible Defense Framework for Proactively Mitigating Word-level Adversarial Attacks

Adversarial attacks on deep neural networks keep raising security concerns in natural language processing research. Existing defenses focus on improving the robustness of the victim model in the training stage. However, they often neglect to proactively mitigate adversarial attacks during inference. Towards this overlooked aspect, we propose a defense framework that aims to mitigate attacks by confusing attackers and correcting adversarial contexts that are caused by malicious perturbations. Our framework comprises three components: (1) a synonym-based transformation to randomly corrupt adversarial contexts in the word level, (2) a developed BERT defender to correct abnormal contexts in the representation level, and (3) a simple detection method to filter out adversarial examples, any of which can be flexibly combined. Additionally, our framework helps improve the robustness of the victim model during training. Extensive experiments demonstrate the effectiveness of our framework in defending against word-level adversarial attacks.

[1]  Vikram Pudi,et al.  A Strong Baseline for Query Efficient Attacks in a Black Box Setting , 2021, EMNLP.

[2]  Cho-Jui Hsieh,et al.  Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution , 2021, EMNLP.

[3]  Hong Liu,et al.  Towards Robustness Against Natural Language Word Substitutions , 2021, ICLR.

[4]  Linyang Li,et al.  Certified Robustness to Text Adversarial Attacks by Randomized [MASK] , 2021, CL.

[5]  Jonathan Berant,et al.  Achieving Model Robustness through Discrete Adversarial Training , 2021, EMNLP.

[6]  Noseong Park,et al.  SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher , 2020, ACL.

[7]  Ming Zhou,et al.  Neural Deepfake Detection with Factual Structure of Text , 2020, EMNLP.

[8]  Shuohang Wang,et al.  T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack , 2020, EMNLP.

[9]  Zhiyuan Liu,et al.  OpenAttack: An Open-source Textual Adversarial Attack Toolkit , 2020, ACL.

[10]  Qiang Liu,et al.  SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions , 2020, ACL.

[11]  Hongtao Wang,et al.  Defense of Word-level Adversarial Attacks via Random Substitution Encoding , 2020, KSEM.

[12]  Siddhant Garg,et al.  BAE: BERT-based Adversarial Examples for Text Classification , 2020, EMNLP.

[13]  Xipeng Qiu,et al.  BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[14]  Masashi Sugiyama,et al.  Do We Need Zero Training Loss After Achieving Zero Training Error? , 2020, ICML.

[15]  Florian Tramèr,et al.  On Adaptive Attacks to Adversarial Example Defenses , 2020, NeurIPS.

[16]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[17]  Sven Gowal,et al.  Scalable Verified Training for Provably Robust Image Classification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[19]  T. Goldstein,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2019, ICLR.

[20]  Kun He,et al.  Natural language adversarial defense through synonym encoding , 2019, UAI.

[21]  M. Zhou,et al.  Reasoning Over Semantic-Level Graph for Fact Checking , 2019, ACL.

[22]  Dani Yogatama,et al.  Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation , 2019, EMNLP.

[23]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[24]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Wanxiang Che,et al.  Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency , 2019, ACL.

[27]  Ming Zhou,et al.  Automatic Grammatical Error Correction for Sequence-to-sequence Text Generation: An Empirical Study , 2019, ACL.

[28]  Bhuwan Dhingra,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[29]  Iryna Gurevych,et al.  Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems , 2019, NAACL.

[30]  Nina Narodytska,et al.  RelGAN: Relational Generative Adversarial Networks for Text Generation , 2018, ICLR.

[31]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[32]  Dejing Dou,et al.  On Adversarial Examples for Character-Level Neural Machine Translation , 2018, COLING.

[33]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[34]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[35]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[36]  Alan L. Yuille,et al.  Mitigating adversarial effects through randomization , 2017, ICLR.

[37]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[38]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[39]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[40]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[41]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[42]  David Vandyke,et al.  Counter-fitting Word Vectors to Linguistic Constraints , 2016, NAACL.

[43]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[46]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[47]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[48]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[49]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  Xuanjing Huang,et al.  Flooding-X: Improving BERT’s Resistance to Adversarial Attacks via Loss-Restricted Fine-Tuning , 2022, ACL.

[51]  Jiahai Wang,et al.  UECA-Prompt: Universal Prompt for Emotion Cause Analysis , 2022, COLING.

[52]  Maosong Sun,et al.  Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning , 2021, FINDINGS.

[53]  Xuanjing Huang,et al.  Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble , 2021, ACL.

[54]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .