Universal Adversarial Perturbation for Text Classification

Given a state-of-the-art deep neural network text classifier, we show the existence of a universal and very small perturbation vector (in the embedding space) that causes natural text to be misclassified with high probability. Unlike images on which a single fixed-size adversarial perturbation can be found, text is of variable length, so we define the "universality" as "token-agnostic", where a single perturbation is applied to each token, resulting in different perturbations of flexible sizes at the sequence level. We propose an algorithm to compute universal adversarial perturbations, and show that the state-of-the-art deep neural networks are highly vulnerable to them, even though they keep the neighborhood of tokens mostly preserved. We also show how to use these adversarial perturbations to generate adversarial text samples. The surprising existence of universal "token-agnostic" adversarial perturbations may reveal important properties of a text classifier.

[1]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for NLP , 2017, ArXiv.

[2]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[4]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[5]  Isay Katsman,et al.  Generative Adversarial Perturbations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[7]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Patrick D. McDaniel,et al.  Adversarial Perturbations Against Deep Neural Networks for Malware Classification , 2016, ArXiv.

[9]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[10]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[11]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[12]  Larry S. Davis,et al.  Universal Adversarial Training , 2018, AAAI.

[13]  Moustapha Cissé,et al.  Houdini: Fooling Deep Structured Prediction Models , 2017, ArXiv.

[14]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[15]  R. Venkatesh Babu,et al.  Fast Feature Fool: A data independent approach to universal adversarial perturbations , 2017, BMVC.

[16]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Quan Z. Sheng,et al.  Generating Textual Adversarial Examples for Deep Learning Models: A Survey , 2019, ArXiv.

[21]  Sameep Mehta,et al.  Generating Adversarial Text Samples , 2018, ECIR.

[22]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[23]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[24]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Attacks on Text Classifiers , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[26]  Fei Wang,et al.  Identify Susceptible Locations in Medical Records via Adversarial Attacks on Deep Predictive Models , 2018, KDD.

[27]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[28]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[29]  Pascal Frossard,et al.  Analysis of universal adversarial perturbations , 2017, ArXiv.