论文信息 - A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.

Zhe Gan | Jingjing Liu | Linjie Li

[1] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[2] Hongxia Jin,et al. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Zhiwu Lu,et al. Counterfactual VQA: A Cause-Effect Look at Language Bias , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Yu Cheng,et al. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models , 2020, ECCV.

[5] Bernard Ghanem,et al. FLAG: Adversarial Data Augmentation for Graph Neural Networks , 2020, ArXiv.

[6] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] David A. Wagner,et al. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[8] David Reitter,et al. Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[9] Eric Horvitz,et al. SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions , 2020, ArXiv.

[10] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[11] Chitta Baral,et al. VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, ECCV.

[12] Jianfeng Gao,et al. Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[13] Anton van den Hengel,et al. Unshuffling Data for Improved Generalization , 2020, ArXiv.

[14] Ajay Divakaran,et al. Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[15] Bernt Schiele,et al. Adversarial Scene Editing: Automatic Object Removal from Weak Supervision , 2018, NeurIPS.

[16] Alan L. Yuille,et al. Feature Denoising for Improving Adversarial Robustness , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Dan Boneh,et al. Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[18] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Anton van den Hengel,et al. Counterfactual Vision and Language Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Eric Horvitz,et al. SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yonatan Belinkov,et al. Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[23] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[24] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[25] Yu Cheng,et al. FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.

[26] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[28] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[29] Jianfeng Gao,et al. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[31] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[32] Jianfeng Gao,et al. VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.

[33] J. Zico Kolter,et al. Fast is better than free: Revisiting adversarial training , 2020, ICLR.

[34] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Dhruv Batra,et al. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[37] Michael I. Jordan,et al. Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[38] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[39] Larry S. Davis,et al. Adversarial Training for Free! , 2019, NeurIPS.

[40] Christian Wolf,et al. Roses are Red, Violets are Blue… But Should VQA expect Them To? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Matthieu Cord,et al. RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[42] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[43] Vahid Kazemi,et al. Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[44] Luke Zettlemoyer,et al. Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[45] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[46] Aleksander Madry,et al. Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[47] Anton van den Hengel,et al. On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law , 2020, NeurIPS.

[48] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[49] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Arjun Majumdar,et al. Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[51] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[52] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[53] Raymond J. Mooney,et al. Self-Critical Reasoning for Robust Visual Question Answering , 2019, NeurIPS.

[54] Christopher D. Manning,et al. GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.

[55] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[56] Yunde Jia,et al. Overcoming Language Priors in VQA via Decomposed Linguistic Representations , 2020, AAAI.

[57] Anurag Mittal,et al. Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder , 2020, ECCV.

[58] Peng Gao,et al. Contrastive Visual-Linguistic Pretraining , 2020, ArXiv.

[59] Abhishek Das,et al. Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline , 2020, ECCV.

[60] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[61] Christopher D. Manning,et al. Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[62] Anton van den Hengel,et al. Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision , 2020, ECCV.

[63] Hongxia Yang,et al. InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining , 2020, ArXiv.

[64] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[65] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[66] Shiliang Pu,et al. Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Shih-Fu Chang,et al. Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions , 2020, ArXiv.

[68] Quoc V. Le,et al. Adversarial Examples Improve Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[70] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Xinlei Chen,et al. Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[72] Jingren Zhou,et al. InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction , 2020 .

[73] Wenhu Chen,et al. Meta Module Network for Compositional Visual Reasoning , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[74] Martin J. Wainwright,et al. Randomized Smoothing for Stochastic Optimization , 2011, SIAM J. Optim..

[75] Yue Wang,et al. VD-BERT: A Unified Vision and Dialog Transformer with BERT , 2020, EMNLP.

[76] Xiaodong Liu,et al. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2020, ACL.

[77] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[78] Mario Fritz,et al. Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79] Matthias Bethge,et al. A Simple Way to Make Neural Networks Robust Against Diverse Image Corruptions , 2020, ECCV.

[80] Chitta Baral,et al. MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering , 2020, EMNLP.

[81] Wei Emma Zhang,et al. Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering , 2020, ECCV.

[82] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, ArXiv.

[83] Hang Su,et al. Boosting Adversarial Training with Hypersphere Embedding , 2020, NeurIPS.