Adversarial attacks on text classification models using layer‐wise relevance propagation

Due to the nested nonlinear structure inside neural networks, most existing deep learning models are treated as black boxes, and they are highly vulnerable to adversarial attacks. On the one hand, adversarial examples shed light on the decision‐making process of these opaque models to interrogate the interpretability. On the other hand, interpretability can be used as a powerful tool to assist in the generation of adversarial examples by affording transparency on the relative contribution of each input feature to the final prediction. Recently, a post‐hoc explanatory method, layer‐wise relevance propagation (LRP), shows significant value in instance‐wise explanations. In this paper, we attempt to optimize the recently proposed explanation‐based attack algorithms (EAAs) on text classification models with LRP. We empirically show that LRP provides good explanations and benefits existing EAAs notably. Apart from that, we propose a LRP‐based simple but effective EAA, LRPTricker. LRPTricker uses LRP to identify important words and subsequently performs typo‐based perturbations on these words to generate the adversarial texts. The extensive experiments show that LRPTricker is able to reduce the performance of text classification models significantly with infinitesimal perturbations as well as lead to high scalability.

[1]  Yang Liu,et al.  Visualizing and Understanding Neural Machine Translation , 2017, ACL.

[2]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[3]  Alexander Binder,et al.  Evaluating the Visualization of What a Deep Neural Network Has Learned , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Klaus-Robert Müller,et al.  Explaining Predictions of Non-Linear Classifiers in NLP , 2016, Rep4NLP@ACL.

[5]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[6]  Misha Denil,et al.  Extraction of Salient Sentences from Labelled Documents , 2014, ArXiv.

[7]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[8]  Graham Rawlinson,et al.  The Significance of Letter Position in Word Recognition , 2007, IEEE Aerospace and Electronic Systems Magazine.

[9]  Thomas Lukasiewicz,et al.  Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods , 2019, ArXiv.

[10]  Xiaosen Wang,et al.  Natural Language Adversarial Attacks and Defenses in Word Level , 2019, ArXiv.

[11]  Wanxiang Che,et al.  Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency , 2019, ACL.

[12]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[13]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[14]  Alexandros G. Dimakis,et al.  Discrete Attacks and Submodular Optimization with Applications to Text Classification , 2018, ArXiv.

[15]  Klaus-Robert Müller,et al.  Evaluating Recurrent Neural Network Explanations , 2019, BlackboxNLP@ACL.

[16]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[17]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[18]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[19]  Quanshi Zhang,et al.  Towards a Deep and Unified Understanding of Deep Neural Models in NLP , 2019, ICML.

[20]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[21]  Yanjun Qi,et al.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[22]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[23]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[24]  Hiroyuki Shindo,et al.  Interpretable Adversarial Perturbation in Input Embedding Space for Text , 2018, IJCAI.

[25]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[26]  Lei Li,et al.  Generating Fluent Adversarial Examples for Natural Languages , 2019, ACL.

[27]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[28]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[29]  Dong Nguyen,et al.  Comparing Automatic and Human Evaluation of Local Explanations for Text Classification , 2018, NAACL.

[30]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[31]  Peter Szolovits,et al.  Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment , 2019, ArXiv.

[32]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[33]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[34]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[35]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[36]  Alexander Binder,et al.  Analyzing Classifiers: Fisher Vectors and Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Klaus-Robert Müller,et al.  Explaining Recurrent Neural Network Predictions in Sentiment Analysis , 2017, WASSA@EMNLP.

[38]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.