RLIPv2: Fast Scaling of Relational Language-Image Pre-training
暂无分享,去创建一个
Samuel Albanie | Jianwen Jiang | Deli Zhao | Xiang Wang | Yingya Zhang | Shiwei Zhang | Xiang Wang | Hangjie Yuan | D. Ni | Tao Feng | Jianwen Jiang | Yining Pan | Shiwei Zhang | Hangjie Yuan | Yining Pan | Tao Feng | Yingya Zhang | Deli Zhao
[1] Jingren Zhou,et al. VideoComposer: Compositional Video Synthesis with Motion Controllability , 2023, NeurIPS.
[2] B. Ommer,et al. SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis , 2023, 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[3] B. Rosenhahn,et al. Learning Similarity between Scene Graphs and Images with Transformers , 2023, ArXiv.
[4] Florian Schroff,et al. Unified Visual Relationship Detection with Vision and Language Models , 2023, ArXiv.
[5] Hanie Sedghi,et al. The Role of Pre-training Data in Transfer Learning , 2023, ArXiv.
[6] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[7] Gabriel Ilharco,et al. Reproducible Scaling Laws for Contrastive Language-Image Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Hang Li,et al. X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks , 2022, ArXiv.
[9] Ming-Hsuan Yang,et al. Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training , 2022, ArXiv.
[10] W. Zhang,et al. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection , 2022, NeurIPS.
[11] Samuel Albanie,et al. RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection , 2022, NeurIPS.
[12] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.
[13] Cewu Lu,et al. Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection , 2022, ECCV.
[14] Yann LeCun,et al. Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone , 2022, NeurIPS.
[15] Liunian Harold Li,et al. GLIPv2: Unifying Localization and Vision-Language Understanding , 2022, 2206.05836.
[16] Ting Yao,et al. Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Changxing Ding,et al. Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[19] Errui Ding,et al. Human-Object Interaction Detection via Disentangled Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Chi-Keung Tang,et al. Interactiveness Field in Human-Object Interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Z. Tu,et al. X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks , 2022, ECCV.
[22] Hyunwoo J. Kim,et al. Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Jonghwan Mun,et al. MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Xiaobo Li,et al. GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] L. Ni,et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Hangjie Yuan,et al. Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics , 2022, AAAI.
[27] Hang Su,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.
[28] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[29] B. Rosenhahn,et al. RelTR: Relation Transformer for Scene Graph Generation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[30] Xuming He,et al. SGTR: End-to-end Scene Graph Generation with Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Keiji Yanai,et al. QAHOI: Query-Based Anchors for Human-Object Interaction Detection , 2021, 2023 18th International Conference on Machine Vision and Applications (MVA).
[32] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Frederic Z. Zhang,et al. Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Zi-Yi Dou,et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.
[36] Jianwei Yang,et al. Learning to Generate Scene Graph from Natural Language Supervision , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[37] Hangjie Yuan,et al. Spatio-Temporal Dynamic Inference Network for Group Activity Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[39] Chen Gao,et al. Mining the Benefits of Two-stage and One-stage HOI Detection , 2021, NeurIPS.
[40] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[41] Lu Yuan,et al. Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Hangjie Yuan,et al. Learning Visual Context for Group Activity Recognition , 2021, AAAI.
[43] Eun-Sol Kim,et al. HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] D. Tao,et al. Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Y. Qiao,et al. Affordance Transfer Learning for Human-Object Interaction Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Xuming He,et al. Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Masood S. Mortazavi,et al. Fully Convolutional Scene Graph Generation , 2021, Computer Vision and Pattern Recognition.
[49] Dacheng Tao,et al. Detecting Human-Object Interaction via Fabricated Compositional Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] C. Qian,et al. Reformulating HOI Detection as Adaptive Set Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Tomoaki Yoshinaga,et al. QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Jian Sun,et al. End-to-End Human Object Interaction Detection with HOI Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[54] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[55] Cewu Lu,et al. HOI Analysis: Integrating and Decomposing Human-Object Interaction , 2020, NeurIPS.
[56] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[57] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.
[58] Chen Gao,et al. DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.
[59] Jaewoo Kang,et al. UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection , 2020, ECCV.
[60] Y. Qiao,et al. Visual Compositional Learning for Human-Object Interaction Detection , 2020, ECCV.
[61] Quoc V. Le,et al. Rethinking Pre-training and Self-training , 2020, NeurIPS.
[62] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[63] Jinquan Zeng,et al. GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Jianqiang Huang,et al. Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Jiashi Feng,et al. PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Juan Carlos Niebles,et al. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Xilin Chen,et al. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[68] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[69] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[70] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[71] Liang Lin,et al. Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Ji Zhang,et al. Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[73] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Wei Liu,et al. Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[75] Kaiming He,et al. Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[76] Cewu Lu,et al. Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[77] Jordi Pont-Tuset,et al. The Open Images Dataset V4 , 2018, International Journal of Computer Vision.
[78] Tao Mei,et al. Exploring Visual Relationship for Image Captioning , 2018, ECCV.
[79] Song-Chun Zhu,et al. Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.
[80] Stefan Lee,et al. Graph R-CNN for Scene Graph Generation , 2018, ECCV.
[81] Chen Gao,et al. iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.
[82] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[83] Xiaogang Wang,et al. Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.
[84] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[85] Gang Sun,et al. Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[86] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[87] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[88] Kaiming He,et al. Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[89] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[90] Jia Deng,et al. Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).
[91] Danfei Xu,et al. Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[92] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[93] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.
[94] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[95] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[96] Jiaxuan Wang,et al. HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[97] Li Fei-Fei,et al. Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.
[98] Michael S. Bernstein,et al. Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[99] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[100] Jitendra Malik,et al. Visual Semantic Role Labeling , 2015, ArXiv.
[101] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[102] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[103] Hang Li,et al. X 2 -VLM: All-In-One Pre-trained Model For Vision-Language Tasks , 2022 .
[104] Rao Muhammad Anwer,et al. Multi-modal Transformers Excel at Class-agnostic Object Detection , 2021, ArXiv.
[105] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[106] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.