Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[5]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[6]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[9]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[10]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[11]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[12]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[13]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[14]  Pushmeet Kohli,et al.  Training verified learners with learned verifiers , 2018, ArXiv.

[15]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[16]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[17]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[18]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[19]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[20]  Atul Prakash,et al.  Robust Physical-World Attacks on Deep Learning Visual Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Dani Yogatama,et al.  Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation , 2019, EMNLP.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[24]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[25]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[26]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[27]  Zhiyuan Liu,et al.  OpenHowNet: An Open Sememe-based Lexical Knowledge Base , 2019, ArXiv.

[28]  Andreas Vlachos,et al.  Adversarial attacks against Fact Extraction and VERification , 2019, ArXiv.

[29]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.

[30]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[33]  J. Zico Kolter,et al.  Certified Adversarial Robustness via Randomized Smoothing , 2019, ICML.

[34]  Qiang Liu,et al.  SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions , 2020, ACL.

[35]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[36]  John X. Morris,et al.  Reevaluating Adversarial Examples in Natural Language , 2020, FINDINGS.

[37]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[38]  Jianfeng Gao,et al.  Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[39]  Meng Zhang,et al.  Textual Adversarial Attack as Combinatorial Optimization , 2019, 1910.12196.

[40]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[41]  Siddhant Garg,et al.  BAE: BERT-based Adversarial Examples for Text Classification , 2020, EMNLP.

[42]  Sebastian Riedel,et al.  Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension , 2020, Transactions of the Association for Computational Linguistics.

[43]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[44]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[45]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[46]  Jinfeng Li,et al.  TextShield: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation , 2020, USENIX Security Symposium.

[47]  Shuohang Wang,et al.  T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack , 2020, EMNLP.

[48]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[49]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[50]  John X. Morris,et al.  TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , 2020, EMNLP.

[51]  T. Goldstein,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2019, ICLR.

[52]  Tad Hogg,et al.  Origins of Algorithmic Instabilities in Crowdsourced Ranking , 2020, Proc. ACM Hum. Comput. Interact..

[53]  Xipeng Qiu,et al.  BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[54]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[55]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[56]  Yu Cheng,et al.  InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective , 2020, ICLR.

[57]  Xipeng Qiu,et al.  TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing , 2021, ACL.

[58]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[59]  Zhiyuan Liu,et al.  OpenAttack: An Open-source Textual Adversarial Attack Toolkit , 2020, ACL.

[60]  Mohit Bansal,et al.  Robustness Gym: Unifying the NLP Evaluation Landscape , 2021, NAACL.

[61]  Samuel R. Bowman,et al.  What Will it Take to Fix Benchmarking in Natural Language Understanding? , 2021, NAACL.

[62]  Alex Endert,et al.  Left, Right, and Gender: Exploring Interaction Traces to Mitigate Human Biases , 2021, IEEE Transactions on Visualization and Computer Graphics.