Fully Convolutional Scene Graph Generation

This paper presents a fully convolutional scene graph generation (FCSGG) model that detects objects and relations simultaneously. Most of the scene graph generation frameworks use a pre-trained two-stage object detector, like Faster R-CNN, and build scene graphs using bounding box features. Such pipeline usually has a large number of parameters and low inference speed. Unlike these approaches, FCSGG is a conceptually elegant and efficient bottom-up approach that encodes objects as bounding box center points, and relationships as 2D vector fields which are named as Relation Affinity Fields (RAFs). RAFs encode both semantic and spatial features, and explicitly represent the relationship between a pair of objects by the integral on a sub-region that points from subject to object. FCSGG only utilizes visual features and still generates strong results for scene graph generation. Comprehensive experiments on the Visual Genome dataset demonstrate the efficacy, efficiency, and generalizability of the proposed method. FCSGG achieves highly competitive results on recall and zeroshot recall with significantly reduced inference time.

[1]  Masood S. Mortazavi 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020 , 2020, INTERSPEECH.

[2]  Haoxuan You,et al.  Learning Visual Commonsense for Robust Scene Graph Generation , 2020, ECCV.

[3]  Aaron C. Courville,et al.  Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation , 2020, BMVC.

[4]  Svetlana Lazebnik,et al.  Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jinquan Zeng,et al.  GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Shih-Fu Chang,et al.  Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[8]  James R. Glass,et al.  Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2019, ICLR.

[9]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Petros Maragos,et al.  Attention-Translation-Relation Network for Scalable Scene Graph Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[11]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Michael S. Bernstein,et al.  Visual Relationships as Functions:Enabling Few-Shot Scene Graph Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[13]  Shiguang Shan,et al.  Exploring Context and Visual Pattern of Relationship for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Stephen Lin,et al.  RepPoints: Point Set Representation for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Michael S. Bernstein,et al.  Scene Graph Prediction with Limited Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[16]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[17]  Wei-Ying Ma,et al.  Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  In-So Kweon,et al.  LinkNet: Relational Embedding for Scene Graph , 2018, NeurIPS.

[27]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[28]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[29]  Jianfei Cai,et al.  Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features , 2018, ECCV.

[30]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[31]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Kaiming He,et al.  Group Normalization , 2018, International Journal of Computer Vision.

[33]  S. Sontag On Photography , 2017, Illuminations.

[34]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[40]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[42]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[53]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[54]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[55]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[57]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[58]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[59]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[60]  John Field Language and the Mind , 2005 .

[61]  G. Ryle,et al.  The concept of mind. , 2004, The International journal of psycho-analysis.

[62]  John D. Benjamin Ways of Seeing , 2004, Before Photography.

[63]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[64]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[65]  Petros Maragos,et al.  From Saturation to Zero-Shot Visual Relationship Detection Using Local Context , 2020, BMVC.

[66]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.