HotFlip: White-Box Adversarial Examples for Text Classification

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[3]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[4]  Graham Rawlinson,et al.  The Significance of Letter Position in Word Recognition , 2007, IEEE Aerospace and Electronic Systems Magazine.

[5]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[6]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[7]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[8]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[9]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[10]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[11]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[12]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[13]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[14]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[15]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[16]  Timothy Baldwin,et al.  Learning Robust Representations of Text , 2016, EMNLP.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[21]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[22]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[23]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[24]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[25]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.