TrojText: Test-time Invisible Textual Trojan Insertion

In Natural Language Processing (NLP), intelligent neuron models can be susceptible to textual Trojan attacks. Such attacks occur when Trojan models behave normally for standard inputs but generate malicious output for inputs that contain a specific trigger. Syntactic-structure triggers, which are invisible, are becoming more popular for Trojan attacks because they are difficult to detect and defend against. However, these types of attacks require a large corpus of training data to generate poisoned samples with the necessary syntactic structures for Trojan insertion. Obtaining such data can be difficult for attackers, and the process of generating syntactic poisoned triggers and inserting Trojans can be time-consuming. This paper proposes a solution called TrojText, which aims to determine whether invisible textual Trojan attacks can be performed more efficiently and cost-effectively without training data. The proposed approach, called the Representation-Logit Trojan Insertion (RLI) algorithm, uses smaller sampled test data instead of large training data to achieve the desired attack. The paper also introduces two additional techniques, namely the accumulated gradient ranking (AGR) and Trojan Weights Pruning (TWP), to reduce the number of tuned parameters and the attack overhead. The TrojText approach was evaluated on three datasets (AG's News, SST-2, and OLID) using three NLP models (BERT, XLNet, and DeBERTa). The experiments demonstrated that the TrojText approach achieved a 98.35\% classification accuracy for test sentences in the target class on the BERT model for the AG's News dataset. The source code for TrojText is available at https://github.com/UCF-ML-Research/TrojText.

[1]  Qian Lou,et al.  TrojViT: Trojan Insertion in Vision Transformers , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  W. Liu,et al.  Hardly Perceptible Trojan Attack against Neural Networks with Bit Flips , 2022, ECCV.

[3]  Hongxia Jin,et al.  Language model compression with weighted low-rank factorization , 2022, ICLR.

[4]  X. Zhang,et al.  Piccolo: Exposing Complex Backdoors in NLP Transformer Models , 2022, 2022 IEEE Symposium on Security and Privacy (SP).

[5]  Shangwei Guo,et al.  Triggerless Backdoor Attack for NLP Tasks with Clean Labels , 2021, NAACL.

[6]  Kun Shao,et al.  BDDR: An Effective Defense Against Textual Backdoor Attacks , 2021, Comput. Secur..

[7]  Chengfang Fang,et al.  Backdoor Pre-trained Models Can Transfer to All , 2021, CCS.

[8]  Maosong Sun,et al.  Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer , 2021, EMNLP.

[9]  Shangwei Guo,et al.  BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models , 2021, ICLR.

[10]  F. Koushanfar,et al.  ProFlip: Targeted Trojan Attack with Progressive Bit Flips , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Zhiyuan Liu,et al.  Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger , 2021, ACL.

[12]  Roozbeh Mottaghi,et al.  Contrasting Contrastive Self-Supervised Representation Learning Pipelines , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Zheng Zhang,et al.  Trojaning Language Models for Fun and Profit , 2020, 2021 IEEE European Symposium on Security and Privacy (EuroS&P).

[14]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[15]  Michael Backes,et al.  BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements , 2020, ACSAC.

[16]  Graham Neubig,et al.  Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[17]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[18]  Deliang Fan,et al.  TBT: Targeted Neural Network Attack With Bit Trojan , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[20]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21]  Deliang Fan,et al.  Bit-Flip Attack: Crushing Neural Network With Progressive Bit Search , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[23]  Chao Zhang,et al.  ADA-Tucker: Compressing deep neural networks via adaptive dimension adjustment tucker decomposition , 2019, Neural Networks.

[24]  Zhiru Zhang,et al.  Reverse Engineering Convolutional Neural Networks Through Side-channel Information Leaks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[25]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[28]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[29]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[30]  Hongxia Jin,et al.  DictFormer: Tiny Transformer with Shared Dictionary , 2022, ICLR.

[31]  Ting Liu,et al.  Text Backdoor Detection Using an Interpretable RNN Abstract Model , 2021, IEEE Transactions on Information Forensics and Security.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Lejla Batina,et al.  CSI NN: Reverse Engineering of Neural Network Architectures Through Electromagnetic Side Channel , 2019, USENIX Security Symposium.