Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized vision-and-language (V+L) research. Models such as LXMERT, ViLBERT and UNITER have significantly lifted the state of the art over a wide range of V+L tasks. However, the large number of parameters in such models hinders their application in practice. In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models. We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR. Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks (i.e., the tickets) that strictly match the performance of the full UNITER model. However, it is encouraging to confirm that we can find “relaxed” winning tickets at 50%70% sparsity that maintain 99% of the full accuracy. (ii) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. (iii) Adversarial training can be further used to enhance the performance of the found lottery tickets.

[1]  Erich Elsen,et al.  Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Zhangyang Wang,et al.  A Unified Lottery Ticket Hypothesis for Graph Neural Networks , 2021, ICML.

[3]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jianlong Fu,et al.  Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.

[6]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[7]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[8]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[9]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[10]  Erich Elsen,et al.  The Difficulty of Training Sparse Neural Networks , 2019, ArXiv.

[11]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Soheil Feizi,et al.  Winning Lottery Tickets in Deep Generative Models , 2021, AAAI.

[14]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[16]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[17]  Yuandong Tian,et al.  Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP , 2019, ICLR.

[18]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[19]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[21]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[22]  Jiebo Luo,et al.  TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[24]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[25]  Michael Carbin,et al.  Comparing Rewinding and Fine-tuning in Neural Network Pruning , 2019, ICLR.

[27]  Anna Rumshisky,et al.  When BERT Plays the Lottery, All Tickets Are Winning , 2020, EMNLP.

[28]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[29]  Jianfeng Gao,et al.  Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Gilad Yehudai,et al.  Proving the Lottery Ticket Hypothesis: Pruning is All You Need , 2020, ICML.

[31]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[32]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[33]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[34]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[35]  M. Maire,et al.  Winning the Lottery with Continuous Sparsification , 2019, NeurIPS.

[36]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[37]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[39]  Tom Goldstein,et al.  FreeLB: Enhanced Adversarial Training for Language Understanding , 2019, ICLR 2020.

[40]  Haoyu Ma,et al.  Good Students Play Big Lottery Better , 2021, ArXiv.

[41]  Abhishek Das,et al.  Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline , 2020, ECCV.

[42]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[43]  Shih-Fu Chang,et al.  Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions , 2020, ArXiv.

[44]  Zhe Gan,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, ArXiv.

[45]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[46]  Yue Wang,et al.  Drawing early-bird tickets: Towards more efficient training of deep networks , 2019, ICLR.

[47]  Zhangyang Wang,et al.  Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning , 2021, ICLR.

[48]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[49]  Yu Cheng,et al.  Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models , 2020, ECCV.

[50]  Zhe Gan,et al.  Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly , 2021, ArXiv.

[51]  Hao Chen,et al.  The Lottery Ticket Hypothesis for Object Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  David J. Schwab,et al.  The Early Phase of Neural Network Training , 2020, ICLR.

[53]  Lijuan Wang,et al.  Compressing Visual-linguistic Model via Knowledge Distillation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[55]  Zhe Gan,et al.  A Closer Look at the Robustness of Vision-and-Language Pre-trained Models , 2020, ArXiv.

[56]  Jianlong Fu,et al.  Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ankit Pensia,et al.  Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient , 2020, NeurIPS.

[58]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Shiyu Chang,et al.  The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.

[60]  Peng Gao,et al.  Contrastive Visual-Linguistic Pretraining , 2020, ArXiv.

[61]  Jianfeng Gao,et al.  VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.

[62]  Tianlong Chen,et al.  GANs Can Play Lottery Tickets Too , 2021, ICLR.

[63]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[64]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[65]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Cho-Jui Hsieh,et al.  What Does BERT with Vision Look At? , 2020, ACL.

[67]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[68]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[69]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[70]  Gintare Karolina Dziugaite,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[71]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[72]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[73]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[74]  Arjun Majumdar,et al.  Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[75]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[76]  Shiyu Chang,et al.  The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[79]  Ildoo Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[80]  Zhe Gan,et al.  EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets , 2020, ACL.

[81]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[82]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[83]  Yuankai Qi,et al.  A Recurrent Vision-and-Language BERT for Navigation , 2020, ArXiv.

[84]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[85]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[86]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.