论文信息 - Single-Stage Visual Relationship Learning using Conditional Queries

Single-Stage Visual Relationship Learning using Conditional Queries

Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labelling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR[3], a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% less parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on Visual Genome dataset, yet is capable of end-to-end training and faster inference.

N. Vasconcelos | Subarna Tripathi | Tz-Ying Wu | Alakh Desai

[1] B. Rosenhahn,et al. RelTR: Relation Transformer for Scene Graph Generation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] N. Vasconcelos,et al. YORO - Lightweight End to End Visual Grounding , 2022, ECCV Workshops.

[3] Bjoern H Menze,et al. Relationformer: A Unified Framework for Image-to-Graph Generation , 2022, ECCV.

[4] Xuming He,et al. SGTR: End-to-end Scene Graph Generation with Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] X. Zhang,et al. MOTR: End-to-End Multiple-Object Tracking with TRansformer , 2021, ECCV.

[6] Yu-Gang Jiang,et al. Scene Graph Refinement Network for Visual Question Answering , 2022, IEEE Transactions on Multimedia.

[7] Graham W. Taylor,et al. Context-aware Scene Graph Generation with Seq2Seq Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Shuicheng Yan,et al. PnP-DETR: Towards Efficient Visual Analysis with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Nuno Vasconcelos,et al. Learning of Visual Relations: The Devil is in the Tails , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Ling Shao,et al. ISTR: End-to-End Instance Segmentation with Transformers , 2021, ArXiv.

[11] Eun-Sol Kim,et al. HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Xuming He,et al. Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Masood S. Mortazavi,et al. Fully Convolutional Scene Graph Generation , 2021, Computer Vision and Pattern Recognition.

[14] Tanaya Guha,et al. In Defense of Scene Graphs for Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Kenneth Ward Church,et al. Exploring Long Tail Visual Relationship Recognition with Large Vocabulary , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Xian-Sheng Hua,et al. PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation , 2020, ACM Multimedia.

[17] Jorma Laaksonen,et al. Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models , 2020, BMVC.

[18] Hsin-Ying Lee,et al. RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval , 2020, ECCV.

[19] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[20] Volker Tresp,et al. Relation Transformer Network , 2020, ArXiv.

[21] Jinquan Zeng,et al. GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Jianqiang Huang,et al. Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Shih-Fu Chang,et al. Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[24] Xilin Chen,et al. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25] Petros Maragos,et al. Attention-Translation-Relation Network for Scalable Scene Graph Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[26] Shiguang Shan,et al. Exploring Context and Visual Pattern of Relationship for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Xiaogang Wang,et al. PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[28] Jianfei Cai,et al. Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Liang Lin,et al. Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Jianfei Cai,et al. Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Wei Liu,et al. Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[35] In-So Kweon,et al. LinkNet: Relational Embedding for Scene Graph , 2018, NeurIPS.

[36] Tao Mei,et al. Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[37] Stefan Lee,et al. Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[38] Nenghai Yu,et al. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[39] Xiaogang Wang,et al. Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[40] Li Fei-Fei,et al. Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Xiaogang Wang,et al. Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43] Larry S. Davis,et al. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[45] Danfei Xu,et al. Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[47] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Michael S. Bernstein,et al. Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.