Context-aware Scene Graph Generation with Seq2Seq Transformers

Scene graph generation is an important task in computer vision aimed at improving the semantic understanding of the visual world. In this task, the model needs to detect objects and predict visual relationships between them. Most of the existing models predict relationships in parallel assuming their independence. While there are different ways to capture these dependencies, we explore a conditional approach motivated by the sequence-to-sequence (Seq2Seq) formalism. Different from the previous research, our proposed model predicts visual relationships one at a time in an autoregressive manner by explicitly conditioning on the already predicted relationships. Drawing from translation models in NLP, we propose an encoder-decoder model built using Transformers where the encoder captures global context and long range interactions. The decoder then makes sequential predictions by conditioning on the scene graph constructed so far. In addition, we introduce a novel reinforcement learning-based training strategy tailored to Seq2Seq scene graph generation. By using a self-critical policy gradient training approach with Monte Carlo search we directly optimize for the (mean) recall metrics and bridge the gap between training and evaluation. Experimental results on two public benchmark datasets demonstrate that our Seq2Seq learning approach achieves strong empirical performance, outperforming previous state-of-the-art, while remaining efficient in terms of training and inference time. Full code for this work is available here: https://github.com/layer6ai-labs/SGG-Seq2Seq.

[1]  Roger Zimmermann,et al.  Recovering the Unbiased Scene Graphs from the Biased Ones , 2021, ACM Multimedia.

[2]  Leonid Sigal,et al.  Segmentation-grounded Scene Graph Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Masood S. Mortazavi,et al.  Fully Convolutional Scene Graph Generation , 2021, Computer Vision and Pattern Recognition.

[4]  L. Sigal,et al.  Energy-Based Learning for Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ivan Laptev,et al.  Training Vision Transformers for Image Retrieval , 2021, ArXiv.

[6]  Parham Aarabi,et al.  SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chenhui Chu,et al.  Understanding the Role of Scene Graphs in Visual Question Answering , 2021, ArXiv.

[8]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[9]  Xian-Sheng Hua,et al.  PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation , 2020, ACM Multimedia.

[10]  Xilin Chen,et al.  Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation , 2020, ECCV.

[11]  Aaron C. Courville,et al.  Generative Compositional Augmentations for Scene Graph Prediction , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Stephan Günnemann,et al.  Scene Graph Reasoning for Visual Question Answering , 2020, ArXiv.

[13]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[14]  Aaron C. Courville,et al.  Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation , 2020, BMVC.

[15]  Ayush Mangal,et al.  Visual Relationship Detection using Scene Graphs: A Survey , 2020, ArXiv.

[16]  Pengfei Xu,et al.  A Survey of Scene Graph: Generation and Application , 2020 .

[17]  Svetlana Lazebnik,et al.  Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Volker Tresp,et al.  Relation Transformer Network , 2020, ArXiv.

[19]  Oliver Schulte,et al.  Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation , 2020, AAAI.

[20]  Jinquan Zeng,et al.  GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Shih-Fu Chang,et al.  Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[23]  Y. Lu,et al.  Learning Effective Visual Relationship Detector on 1 GPU , 2019, ArXiv.

[24]  Ju-Whan Kim,et al.  Visual Question Answering over Scene Graph , 2019, 2019 First International Conference on Graph Computing (GC).

[25]  Cheng Zhang,et al.  An Empirical Study on Leveraging Scene Graphs for Visual Question Answering , 2019, BMVC.

[26]  Christopher D. Manning,et al.  Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[27]  Yang Zhang,et al.  PANet: A Context Based Predicate Association Network for Scene Graph Generation , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[28]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[29]  Shiguang Shan,et al.  Exploring Context and Visual Pattern of Relationship for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  M. Võ,et al.  Reading scenes: how scene grammar guides attention and aids perception in real-world environments. , 2019, Current opinion in psychology.

[31]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ji Zhang,et al.  Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Juan-Zi Li,et al.  Explainable and Explicit Visual Reasoning Over Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Weijian Li,et al.  Attentive Relational Networks for Mapping Images to Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[40]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[41]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[43]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Eric P. Xing,et al.  Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Matthew B. Blaschko,et al.  Joint Embeddings of Scene Graphs and Images , 2017, ICLR.

[53]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[56]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[57]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[58]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[59]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[63]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.