GLIPv2: Unifying Localization and Vision-Language Understanding

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code is released at https://github.com/microsoft/GLIP .

[1]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[2]  Jianfeng Gao,et al.  Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  B. Schiele,et al.  Omni-DETR: Omni-Supervised Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Lu Yuan,et al.  RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Liunian Harold Li,et al.  Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Hang Li,et al.  Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[7]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Chen Change Loy,et al.  Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.

[9]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[10]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[11]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[12]  Jenq-Neng Hwang,et al.  Monocular 3D Localization of Vehicles in Road Scenes , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[13]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[14]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[15]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lu Yuan,et al.  Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Shih-Fu Chang,et al.  Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions , 2021, NAACL.

[18]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[20]  Derek Hoiem,et al.  Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[22]  Ronghang Hu,et al.  UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Shih-Fu Chang,et al.  Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Zhe Gan,et al.  Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.

[25]  Xiuye Gu,et al.  Zero-Shot Detection via Vision and Language Knowledge Distillation , 2021, ArXiv.

[26]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Subhransu Maji,et al.  PhraseCut: Language-Based Image Segmentation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[29]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[30]  Trevor Darrell,et al.  Frustratingly Simple Few-Shot Object Detection , 2020, ICML.

[31]  Tie-Yan Liu,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[32]  Venkatesh Saligrama,et al.  Don’t Even Look Once: Synthesizing Features for Zero-Shot Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[34]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[35]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[36]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[37]  Ghassan AlRegib,et al.  Learning to Generate Grounded Visual Captions Without Localization Supervision , 2019, ECCV.

[38]  Venkatesh Saligrama,et al.  Zero Shot Detection , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[42]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[43]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[44]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[45]  Matthieu Cord,et al.  Zero-Shot Semantic Segmentation , 2019, NeurIPS.

[46]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[52]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[53]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[58]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[59]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[60]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[61]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[65]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[66]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[67]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[68]  Cordelia Schmid,et al.  Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[70]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[71]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[72]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.