A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.

[1]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[2]  Hongxia Jin,et al.  Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Zhiwu Lu,et al.  Counterfactual VQA: A Cause-Effect Look at Language Bias , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yu Cheng,et al.  Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models , 2020, ECCV.

[5]  Bernard Ghanem,et al.  FLAG: Adversarial Data Augmentation for Graph Neural Networks , 2020, ArXiv.

[6]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[8]  David Reitter,et al.  Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[9]  Eric Horvitz,et al.  SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions , 2020, ArXiv.

[10]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[11]  Chitta Baral,et al.  VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, ECCV.

[12]  Jianfeng Gao,et al.  Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[13]  Anton van den Hengel,et al.  Unshuffling Data for Improved Generalization , 2020, ArXiv.

[14]  Ajay Divakaran,et al.  Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[15]  Bernt Schiele,et al.  Adversarial Scene Editing: Automatic Object Removal from Weak Supervision , 2018, NeurIPS.

[16]  Alan L. Yuille,et al.  Feature Denoising for Improving Adversarial Robustness , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[18]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Anton van den Hengel,et al.  Counterfactual Vision and Language Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Eric Horvitz,et al.  SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yonatan Belinkov,et al.  Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[23]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[24]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[25]  Yu Cheng,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.

[26]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[28]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[29]  Jianfeng Gao,et al.  Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[31]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[32]  Jianfeng Gao,et al.  VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.

[33]  J. Zico Kolter,et al.  Fast is better than free: Revisiting adversarial training , 2020, ICLR.

[34]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[37]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[38]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[39]  Larry S. Davis,et al.  Adversarial Training for Free! , 2019, NeurIPS.

[40]  Christian Wolf,et al.  Roses are Red, Violets are Blue… But Should VQA expect Them To? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[42]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[43]  Vahid Kazemi,et al.  Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[44]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[45]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[46]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[47]  Anton van den Hengel,et al.  On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law , 2020, NeurIPS.

[48]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[49]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Arjun Majumdar,et al.  Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[51]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[52]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[53]  Raymond J. Mooney,et al.  Self-Critical Reasoning for Robust Visual Question Answering , 2019, NeurIPS.

[54]  Christopher D. Manning,et al.  GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.

[55]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[56]  Yunde Jia,et al.  Overcoming Language Priors in VQA via Decomposed Linguistic Representations , 2020, AAAI.

[57]  Anurag Mittal,et al.  Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder , 2020, ECCV.

[58]  Peng Gao,et al.  Contrastive Visual-Linguistic Pretraining , 2020, ArXiv.

[59]  Abhishek Das,et al.  Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline , 2020, ECCV.

[60]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[61]  Christopher D. Manning,et al.  Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[62]  Anton van den Hengel,et al.  Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision , 2020, ECCV.

[63]  Hongxia Yang,et al.  InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining , 2020, ArXiv.

[64]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[65]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[66]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Shih-Fu Chang,et al.  Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions , 2020, ArXiv.

[68]  Quoc V. Le,et al.  Adversarial Examples Improve Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[70]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Xinlei Chen,et al.  Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[72]  Jingren Zhou,et al.  InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction , 2020 .

[73]  Wenhu Chen,et al.  Meta Module Network for Compositional Visual Reasoning , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[74]  Martin J. Wainwright,et al.  Randomized Smoothing for Stochastic Optimization , 2011, SIAM J. Optim..

[75]  Yue Wang,et al.  VD-BERT: A Unified Vision and Dialog Transformer with BERT , 2020, EMNLP.

[76]  Xiaodong Liu,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2020, ACL.

[77]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[78]  Mario Fritz,et al.  Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Matthias Bethge,et al.  A Simple Way to Make Neural Networks Robust Against Diverse Image Corruptions , 2020, ECCV.

[80]  Chitta Baral,et al.  MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering , 2020, EMNLP.

[81]  Wei Emma Zhang,et al.  Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering , 2020, ECCV.

[82]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, ArXiv.

[83]  Hang Su,et al.  Boosting Adversarial Training with Hypersphere Embedding , 2020, NeurIPS.