Single-Stage Visual Relationship Learning using Conditional Queries

Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labelling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR[3], a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% less parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on Visual Genome dataset, yet is capable of end-to-end training and faster inference.

[1]  B. Rosenhahn,et al.  RelTR: Relation Transformer for Scene Graph Generation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  N. Vasconcelos,et al.  YORO - Lightweight End to End Visual Grounding , 2022, ECCV Workshops.

[3]  Bjoern H Menze,et al.  Relationformer: A Unified Framework for Image-to-Graph Generation , 2022, ECCV.

[4]  Xuming He,et al.  SGTR: End-to-end Scene Graph Generation with Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  X. Zhang,et al.  MOTR: End-to-End Multiple-Object Tracking with TRansformer , 2021, ECCV.

[6]  Yu-Gang Jiang,et al.  Scene Graph Refinement Network for Visual Question Answering , 2022, IEEE Transactions on Multimedia.

[7]  Graham W. Taylor,et al.  Context-aware Scene Graph Generation with Seq2Seq Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Shuicheng Yan,et al.  PnP-DETR: Towards Efficient Visual Analysis with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Nuno Vasconcelos,et al.  Learning of Visual Relations: The Devil is in the Tails , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ling Shao,et al.  ISTR: End-to-End Instance Segmentation with Transformers , 2021, ArXiv.

[11]  Eun-Sol Kim,et al.  HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xuming He,et al.  Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Masood S. Mortazavi,et al.  Fully Convolutional Scene Graph Generation , 2021, Computer Vision and Pattern Recognition.

[14]  Tanaya Guha,et al.  In Defense of Scene Graphs for Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Kenneth Ward Church,et al.  Exploring Long Tail Visual Relationship Recognition with Large Vocabulary , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Xian-Sheng Hua,et al.  PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation , 2020, ACM Multimedia.

[17]  Jorma Laaksonen,et al.  Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models , 2020, BMVC.

[18]  Hsin-Ying Lee,et al.  RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval , 2020, ECCV.

[19]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[20]  Volker Tresp,et al.  Relation Transformer Network , 2020, ArXiv.

[21]  Jinquan Zeng,et al.  GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Shih-Fu Chang,et al.  Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[24]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Petros Maragos,et al.  Attention-Translation-Relation Network for Scalable Scene Graph Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[26]  Shiguang Shan,et al.  Exploring Context and Visual Pattern of Relationship for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[28]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[35]  In-So Kweon,et al.  LinkNet: Relational Embedding for Scene Graph , 2018, NeurIPS.

[36]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[37]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[38]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[39]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[40]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.