To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers

Transfer learning --- transferring learned knowledge --- has brought a paradigm shift in the way models are trained. The lucrative benefits of improved accuracy and reduced training time have shown promise in training models with constrained computational resources and fewer training samples. Specifically, publicly available text-based models such as GloVe and BERT that are trained on large corpus of datasets have seen ubiquitous adoption in practice. In this paper, we ask, "can transfer learning in text prediction models be exploited to perform misclassification attacks?" As our main contribution, we present novel attack techniques that utilize unintended features learnt in the teacher (public) model to generate adversarial examples for student (downstream) models. To the best of our knowledge, ours is the first work to show that transfer learning from state-of-the-art word-based and sentence-based teacher models increase the susceptibility of student models to misclassification attacks. First, we propose a novel word-score based attack algorithm for generating adversarial examples against student models trained using context-free word-level embedding model. On binary classification tasks trained using the GloVe teacher model, we achieve an average attack accuracy of 97% for the IMDB Movie Reviews and 80% for the Fake News Detection. For multi-class tasks, we divide the Newsgroup dataset into 6 and 20 classes and achieve an average attack accuracy of 75% and 41% respectively. Next, we present length-based and sentence-based misclassification attacks for the Fake News Detection task trained using a context-aware BERT model and achieve 78% and 39% attack accuracy respectively. Thus, our results motivate the need for designing training techniques that are robust to unintended feature learning, specifically for transfer learned models.

[1]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[2]  Zhibo Wang,et al.  A survey on Adversarial Attacks and Defenses in Text , 2019, ArXiv.

[3]  Ben Y. Zhao,et al.  With Great Training Comes Great Vulnerability: Practical Attacks against Transfer Learning , 2018, USENIX Security Symposium.

[4]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[7]  Tudor Dumitras,et al.  Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[8]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[9]  Yanjun Qi,et al.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[10]  Alexandros G. Dimakis,et al.  Discrete Adversarial Attacks and Submodular Optimization with Applications to Text Classification , 2018, MLSys.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[16]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[17]  Jieping Ye,et al.  Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis , 2015, IEEE Transactions on Big Data.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[20]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[21]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[23]  Ben Y. Zhao,et al.  Latent Backdoor Attacks on Deep Neural Networks , 2019, CCS.

[24]  Khalid M. Mosalam,et al.  Deep Transfer Learning for Image‐Based Structural Damage Recognition , 2018, Comput. Aided Civ. Infrastructure Eng..

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[27]  Hiroyuki Shindo,et al.  Interpretable Adversarial Perturbation in Input Embedding Space for Text , 2018, IJCAI.

[28]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[29]  Louis B. Rall,et al.  Automatic differentiation , 1981 .

[30]  Bo Li,et al.  Adversarial Texts with Gradient Methods , 2018, ArXiv.