Word-Level Textual Adversarial Attack in the Embedding Space

Many studies have revealed the vulnerability of deep neural networks (DNNs) in the face of adversarial attacks. By adding a small perturbation to the input, adversarial attacks could fool many advanced models for computer vision, speech recognition and natural language processing tasks, posing severe security threats to DNNs. In this paper, we proposed a gradient-based word-level attack method in the embedding space to attack text classification models. This method computes the significance of each word and chooses the optimal word for substitution after word embedding; the generated adversarial texts have little semantic changes but could successfully fool classification DNNs. Through extensive experiments, we confirmed that the generated adversarial texts could achieve a success rate approaching 100% with a very low word substitution rate in attacking the WordCNN and LSTM models on three datasets. By human evaluation, the adversarial texts evaded human notice which implied little semantic changes were made. Experiments on different models also confirmed the transferability of the adversarial texts. Finally, we adopted adversarial training, and this improved the models' generalization capacity and robustness.