RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

[1]  Jingren Zhou,et al.  VideoComposer: Compositional Video Synthesis with Motion Controllability , 2023, NeurIPS.

[2]  B. Ommer,et al.  SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis , 2023, 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[3]  B. Rosenhahn,et al.  Learning Similarity between Scene Graphs and Images with Transformers , 2023, ArXiv.

[4]  Florian Schroff,et al.  Unified Visual Relationship Detection with Vision and Language Models , 2023, ArXiv.

[5]  Hanie Sedghi,et al.  The Role of Pre-training Data in Transfer Learning , 2023, ArXiv.

[6]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[7]  Gabriel Ilharco,et al.  Reproducible Scaling Laws for Contrastive Language-Image Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hang Li,et al.  X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks , 2022, ArXiv.

[9]  Ming-Hsuan Yang,et al.  Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training , 2022, ArXiv.

[10]  W. Zhang,et al.  DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection , 2022, NeurIPS.

[11]  Samuel Albanie,et al.  RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection , 2022, NeurIPS.

[12]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[13]  Cewu Lu,et al.  Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection , 2022, ECCV.

[14]  Yann LeCun,et al.  Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone , 2022, NeurIPS.

[15]  Liunian Harold Li,et al.  GLIPv2: Unifying Localization and Vision-Language Understanding , 2022, 2206.05836.

[16]  Ting Yao,et al.  Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Changxing Ding,et al.  Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[19]  Errui Ding,et al.  Human-Object Interaction Detection via Disentangled Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Chi-Keung Tang,et al.  Interactiveness Field in Human-Object Interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Z. Tu,et al.  X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks , 2022, ECCV.

[22]  Hyunwoo J. Kim,et al.  Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jonghwan Mun,et al.  MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xiaobo Li,et al.  GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  L. Ni,et al.  DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hangjie Yuan,et al.  Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics , 2022, AAAI.

[27]  Hang Su,et al.  DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.

[28]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[29]  B. Rosenhahn,et al.  RelTR: Relation Transformer for Scene Graph Generation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Xuming He,et al.  SGTR: End-to-end Scene Graph Generation with Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Keiji Yanai,et al.  QAHOI: Query-Based Anchors for Human-Object Interaction Detection , 2021, 2023 18th International Conference on Machine Vision and Applications (MVA).

[32]  Liunian Harold Li,et al.  Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Frederic Z. Zhang,et al.  Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Daniel Keysers,et al.  LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[36]  Jianwei Yang,et al.  Learning to Generate Scene Graph from Natural Language Supervision , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Hangjie Yuan,et al.  Spatio-Temporal Dynamic Inference Network for Group Activity Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[39]  Chen Gao,et al.  Mining the Benefits of Two-stage and One-stage HOI Detection , 2021, NeurIPS.

[40]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[41]  Lu Yuan,et al.  Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Hangjie Yuan,et al.  Learning Visual Context for Group Activity Recognition , 2021, AAAI.

[43]  Eun-Sol Kim,et al.  HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  D. Tao,et al.  Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Y. Qiao,et al.  Affordance Transfer Learning for Human-Object Interaction Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Xuming He,et al.  Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Masood S. Mortazavi,et al.  Fully Convolutional Scene Graph Generation , 2021, Computer Vision and Pattern Recognition.

[49]  Dacheng Tao,et al.  Detecting Human-Object Interaction via Fabricated Compositional Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  C. Qian,et al.  Reformulating HOI Detection as Adaptive Set Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Tomoaki Yoshinaga,et al.  QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jian Sun,et al.  End-to-End Human Object Interaction Detection with HOI Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[54]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[55]  Cewu Lu,et al.  HOI Analysis: Integrating and Decomposing Human-Object Interaction , 2020, NeurIPS.

[56]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[57]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[58]  Chen Gao,et al.  DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.

[59]  Jaewoo Kang,et al.  UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection , 2020, ECCV.

[60]  Y. Qiao,et al.  Visual Compositional Learning for Human-Object Interaction Detection , 2020, ECCV.

[61]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[62]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[63]  Jinquan Zeng,et al.  GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Jiashi Feng,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[68]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[70]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[71]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Ji Zhang,et al.  Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[76]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[78]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[79]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[80]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[81]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[82]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[83]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[84]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[85]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[86]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[87]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[88]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[90]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[91]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[94]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[95]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[97]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[98]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[101]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[102]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[103]  Hang Li,et al.  X 2 -VLM: All-In-One Pre-trained Model For Vision-Language Tasks , 2022 .

[104]  Rao Muhammad Anwer,et al.  Multi-modal Transformers Excel at Class-agnostic Object Detection , 2021, ArXiv.

[105]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[106]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.