Differentiable Language Model Adversarial Attacks on Categorical Sequence Classifiers

An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models: minor changes of the input can force a model failure. Most of the state of the art frameworks focus on adversarial attacks for images and other structured model inputs, but not for categorical sequences models. Successful attacks on classifiers of categorical sequences are challenging because the model input is tokens from finite sets, so a classifier score is non-differentiable with respect to inputs, and gradient-based attacks are not applicable. Common approaches deal with this problem working at a token level, while the discrete optimization problem at hand requires a lot of resources to solve. We instead use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples. To optimize the model, we define a differentiable loss function that depends on a surrogate classifier score and on a deep learning model that evaluates approximate edit distance. So, we control both the adversability of a generated sequence and its similarity to the initial sequence. As a result, we obtain semantically better samples. Moreover, they are resistant to adversarial training and adversarial detectors. Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[3]  Ajmal Mian,et al.  Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey , 2018, IEEE Access.

[4]  Pan He,et al.  Adversarial Examples: Attacks and Defenses for Deep Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[6]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[7]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[8]  Hiroyuki Shindo,et al.  Interpretable Adversarial Perturbation in Input Embedding Space for Text , 2018, IJCAI.

[9]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[10]  I. Fursov,et al.  Sequence embeddings help to identify fraudulent cases in healthcare insurance , 2019, ArXiv.

[11]  Jun Zhou,et al.  Generating Natural Language Adversarial Examples on a Large Scale with Generative Models , 2020, ECAI.

[12]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[14]  Stephan Günnemann,et al.  Adversarial Attacks on Neural Networks for Graph Data , 2018, KDD.

[15]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[16]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Yanjun Qi,et al.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[19]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[20]  Quan Z. Sheng,et al.  Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey , 2019 .

[21]  Andrey Kravchenko,et al.  Gradient-based adversarial attacks on categorical sequence models via traversing an embedded world , 2020, AIST.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Leonardo Neves,et al.  Multimodal Named Entity Recognition for Short Social Media Posts , 2018, NAACL.

[24]  Ben Poole,et al.  Categorical Reparametrization with Gumble-Softmax , 2017, ICLR 2017.

[25]  James Cheng,et al.  Convolutional Embedding for Edit Distance , 2020, SIGIR.

[26]  Jinfeng Yi,et al.  Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples , 2018, AAAI.

[27]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[28]  Jiliang Tang,et al.  Adversarial Attacks and Defenses in Images, Graphs and Text: A Review , 2019, International Journal of Automation and Computing.

[29]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[30]  Valentin Khrulkov,et al.  Art of Singular Vectors and Universal Adversarial Perturbations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Qingfeng Du,et al.  TextTricker: Loss-based and gradient-based adversarial attacks on text classification models , 2020, Eng. Appl. Artif. Intell..

[32]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[33]  Philip S. Yu,et al.  Adversarial Attack and Defense on Graph Data: A Survey , 2018 .

[34]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[35]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[36]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Peter Szolovits,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2020, AAAI.