Iterative Adversarial Attack on Image-guided Story Ending Generation

Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in an more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.

[1]  P. Frossard,et al.  TransFool: An Adversarial Attack against Neural Machine Translation Models , 2023, ArXiv.

[2]  Shengsheng Qian,et al.  MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer , 2022, ACM Multimedia.

[3]  J. Sang,et al.  Towards Adversarial Attack on Vision-Language Pre-training Models , 2022, ACM Multimedia.

[4]  Can Xu,et al.  Multimodal Dialogue Response Generation , 2021, ACL.

[5]  Hanli Wang,et al.  Improving reasoning with contrastive visual information for visual question answering , 2021, Electronics Letters.

[6]  J. Zico Kolter,et al.  Defending Multimodal Fusion Models against Single-Source Adversaries , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Balaji Vasan Srinivasan,et al.  MIMOQA: Multimodal Input Multimodal Output Question Answering , 2021, NAACL.

[8]  Yuxian Meng,et al.  Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation , 2021, ArXiv.

[9]  Shengsheng Qian,et al.  HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  T. Zhao,et al.  Gumbel-Attention for Multi-modal Machine Translation , 2021, ArXiv.

[11]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[12]  Dong Bok Lee,et al.  Contrastive Learning with Adversarial Perturbations for Conditional Text Generation , 2020, ICLR.

[13]  George Boateng,et al.  Towards Real-Time Multimodal Emotion Recognition among Couples , 2020, ICMI.

[14]  Shuohang Wang,et al.  T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack , 2020, EMNLP.

[15]  Ke-jun Wang,et al.  HEU Emotion: a large-scale database for multimodal emotion recognition in the wild , 2020, Neural Computing and Applications.

[16]  Jun Hu,et al.  Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks , 2020, ICMR.

[17]  D. Song,et al.  Imitation Attacks and Defenses for Black-box Machine Translation Systems , 2020, EMNLP.

[18]  Xipeng Qiu,et al.  BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[19]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[20]  Kun He,et al.  Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks , 2019, ICLR.

[21]  Peter Szolovits,et al.  Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment , 2019, ArXiv.

[22]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[23]  Wanxiang Che,et al.  Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency , 2019, ACL.

[24]  Graham Neubig,et al.  On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models , 2019, NAACL.

[25]  Xinlei Chen,et al.  Cycle-Consistency for Robust Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[27]  C.-C. Jay Kuo,et al.  Unsupervised Multi-Modal Neural Machine Translation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Dejing Dou,et al.  On Adversarial Examples for Character-Level Neural Machine Translation , 2018, COLING.

[29]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[30]  Jinfeng Yi,et al.  Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples , 2018, AAAI.

[31]  Yanjun Qi,et al.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[32]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Jun Zhu,et al.  Boosting Adversarial Attacks with Momentum , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Trevor Darrell,et al.  Fooling Vision and Language Models Despite Localization and Attention Mechanism , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[40]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[41]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[43]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[46]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[47]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[48]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[49]  Ho-fung Leung,et al.  IgSEG: Image-guided Story Ending Generation , 2021, FINDINGS.

[50]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.