Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

Despite the excellent performance of vision-language pre-trained models (VLPs) on conventional VQA task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our experiments involve 3 VLPs, 2 compression methods, 4 training methods, 2 datasets and a range of sparsity levels and random seeds. Our results show that there indeed exist sparse and robust subnetworks, which are competitive with the debiased full VLP and clearly outperform the debiasing SoTAs with fewer parameters on OOD datasets VQA-CP v2 and VQA-VS. The codes can be found at https://github.com/PhoebusSi/Compress-Robust-VQA.

[1]  Zheng Lin,et al.  Combo of Thinking and Observing for Outside-Knowledge VQA , 2023, ACL.

[2]  Jie Zhou,et al.  Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA , 2022, EMNLP.

[3]  Jie Zhou,et al.  Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning , 2022, EMNLP.

[4]  Jingren Zhou,et al.  mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections , 2022, EMNLP.

[5]  Jie Zhou,et al.  Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training , 2022, NAACL.

[6]  Qi Wu,et al.  MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[8]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[9]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[10]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[11]  Yingyan Lin,et al.  Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks , 2021, NeurIPS.

[12]  Furu Wei,et al.  Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression , 2021, EMNLP.

[13]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[14]  Qi Tian,et al.  Greedy Gradient Ensemble for Robust Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[16]  Zheng Lin,et al.  Check It Again:Progressive Visual Question Answering via Visual Entailment , 2021, ACL.

[17]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Haifeng Hu,et al.  LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering , 2021, SIGIR.

[19]  Xiaodong Liu,et al.  Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization , 2021, ACL.

[20]  Zhe Gan,et al.  Playing Lottery Tickets with Vision and Language , 2021, AAAI.

[21]  Lijuan Wang,et al.  Compressing Visual-linguistic Model via Knowledge Distillation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Fengcheng Yuan,et al.  ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques , 2021, AAAI.

[23]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[24]  Weitao Jiang,et al.  Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering , 2020, EMNLP.

[25]  Qi Tian,et al.  Loss Re-Scaling VQA: Revisiting the Language Prior Problem From a Class-Imbalance View , 2020, IEEE Transactions on Image Processing.

[26]  Qun Liu,et al.  TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[27]  Chitta Baral,et al.  MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering , 2020, EMNLP.

[28]  Yang Zhang,et al.  The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.

[29]  Anurag Mittal,et al.  Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder , 2020, ECCV.

[30]  Zhiwu Lu,et al.  Counterfactual VQA: A Cause-Effect Look at Language Bias , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[32]  Anna Rumshisky,et al.  When BERT Plays the Lottery, All Tickets Are Winning , 2020, EMNLP.

[33]  Martin Jaggi,et al.  Masking as an Efficient Alternative to Finetuning for Pretrained Language Models , 2020, EMNLP.

[34]  Xin Wang,et al.  How fine can fine-tuning be? Learning efficient language models , 2020, AISTATS.

[35]  Yunde Jia,et al.  Overcoming Language Priors in VQA via Decomposed Linguistic Representations , 2020, AAAI.

[36]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[38]  S. Jana,et al.  HYDRA: Pruning Adversarially Robust Neural Networks , 2020, NeurIPS.

[39]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[40]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[42]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[43]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[44]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[45]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[46]  James Henderson,et al.  Simple but effective techniques to reduce biases , 2019, ArXiv.

[47]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[48]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[49]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[50]  Regina Barzilay,et al.  Towards Debiasing Fact Verification Models , 2019, EMNLP.

[51]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[52]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[53]  Yonatan Belinkov,et al.  Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[54]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[56]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[57]  Hao Cheng,et al.  Adversarial Robustness vs. Model Compression, or Both? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[59]  Ji Liu,et al.  Model Compression with Adversarial Robustness: A Unified Optimization Framework , 2019, NeurIPS.

[60]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[61]  Larry S. Davis,et al.  Explicit Bias Discovery in Visual Question Answering Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Stefan Lee,et al.  Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , 2018, NeurIPS.

[63]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[64]  Svetlana Lazebnik,et al.  Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights , 2018, ECCV.

[65]  Diederik P. Kingma,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[66]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[68]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[70]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[71]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[72]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[73]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[74]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[75]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[77]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[78]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[79]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[80]  Qi Wu,et al.  Debiased Visual Question Answering from Feature and Sample Perspectives , 2021, NeurIPS.

[81]  Milad Shokouhi,et al.  What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression , 2021, ArXiv.

[82]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .

[83]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.