T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack

Adversarial attacks against natural language processing systems, which perform seemingly innocuous modifications to inputs, can induce arbitrary mistakes to the target models. Though raised great concerns, such adversarial attacks can be leveraged to estimate the robustness of NLP models. Compared with the adversarial example generation in continuous data domain (e.g., image), generating adversarial text that preserves the original meaning is challenging since the text space is discrete and non-differentiable. To handle these challenges, we propose a target-controllable adversarial attack framework T3, which is applicable to a range of NLP tasks. In particular, we propose a tree-based autoencoder to embed the discrete text data into a continuous representation space, upon which we optimize the adversarial perturbation. A novel tree-based decoder is then applied to regularize the syntactic correctness of the generated text and manipulate it on either sentence (T3(Sent)) or word (T3(Word)) level. We consider two most representative NLP tasks: sentiment analysis and question answering (QA). Extensive experimental results and human studies show that T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human. Moreover, we show that the generated adversarial texts have high transferability which enables the black-box attacks in practice. Our work sheds light on an effective and general way to examine the robustness of NLP models. Our code is publicly available at https://github.com/AI-secure/T3/.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Peter Szolovits,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2020, AAAI.

[3]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[4]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[5]  Alexandros G. Dimakis,et al.  Discrete Adversarial Attacks and Submodular Optimization with Applications to Text Classification , 2018, MLSys.

[6]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[7]  Po-Sen Huang,et al.  Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation , 2019, EMNLP/IJCNLP.

[8]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[9]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[10]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[11]  Jinfeng Yi,et al.  Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples , 2018, AAAI.

[12]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[13]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[14]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[15]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[16]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Luke S. Zettlemoyer,et al.  Deep Semantic Role Labeling: What Works and What’s Next , 2017, ACL.

[20]  Wanxiang Che,et al.  Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency , 2019, ACL.

[21]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[27]  Graham Neubig,et al.  On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models , 2019, NAACL.

[28]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[31]  Kyunghyun Cho,et al.  Non-Monotonic Sequential Text Generation , 2019, ICML.

[32]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[33]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[34]  Yoshimasa Tsuruoka,et al.  Learning to Parse and Translate Improves Neural Machine Translation , 2017, ACL.

[35]  Lei Li,et al.  Generating Fluent Adversarial Examples for Natural Languages , 2019, ACL.

[36]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[37]  Yoav Goldberg,et al.  Towards String-To-Tree Neural Machine Translation , 2017, ACL.

[38]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.