Scene Graph Inference via Multi-Scale Context Modeling

The scene graph generated for an image structurally represents its object interactions and it substantially aids image scene understanding. To the best of our knowledge, most current works on scene graph generation chiefly focus on pairwise object regions for object and relation inference while ignoring the global visual context outside of these regions. Guided by the intuition that object/relation inference can benefit from the visual context within an image, this paper proposes a multi-scale context modeling method, which can jointly discover and integrate the complementary object-centric and region-centric context for scene graph inference. While both the object-centric and region-centric contexts are separately modeled by their individual modules, a bi-directional message propagation strategy is designed to mutually reinforce the context modeling. A context-fused inference is then proposed to integrate the multi-scale context to guide scene graph inference. Extensive experiments establish that this method can achieve competitive performance compared to the state-of-the-art methods on three benchmarks. Additional ablation studies further validate its effectiveness. Code has been made available at: https://github.com/ningxu1990/MSCM.

[1]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Serge J. Belongie,et al.  Context based object categorization: A critical survey , 2010, Comput. Vis. Image Underst..

[5]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[6]  Li Fei-Fei,et al.  Knowledge Acquisition for Visual Question Answering via Iterative Querying , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[8]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[10]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[11]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[12]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Zhou Yu,et al.  SPRNet: Single-Pixel Reconstruction for One-Stage Instance Segmentation , 2019, IEEE Transactions on Cybernetics.

[14]  Qionghai Dai,et al.  Cross-Modality Bridging and Knowledge Transferring for Image Understanding , 2019, IEEE Transactions on Multimedia.

[15]  Yongdong Zhang,et al.  Multi-Level Policy and Reward Reinforcement Learning for Image Captioning , 2018, IJCAI.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Luming Zhang,et al.  Multiview and Multimodal Pervasive Indoor Localization , 2017, ACM Multimedia.

[18]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Weisi Lin,et al.  Learning Markov Clustering Networks for Scene Text Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[22]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mohan S. Kankanhalli,et al.  Interact as You Intend: Intention-Driven Human-Object Interaction Detection , 2018, IEEE Transactions on Multimedia.

[24]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Jun Yu,et al.  Multimodal Face-Pose Estimation With Multitask Manifold Deep Learning , 2019, IEEE Transactions on Industrial Informatics.

[27]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[28]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Yuxin Peng,et al.  SSDH: Semi-Supervised Deep Hashing for Large Scale Image Retrieval , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Heng Tao Shen,et al.  Exploring Auxiliary Context: Discrete Semantic Transfer Hashing for Scalable Image Retrieval , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Karl Stratos,et al.  Understanding and predicting importance in images , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Mohan S. Kankanhalli,et al.  Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Meng Wang,et al.  Image-Based Three-Dimensional Human Pose Recovery by Multiview Locality-Sensitive Sparse Retrieval , 2015, IEEE Transactions on Industrial Electronics.

[37]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[39]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[40]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[41]  Jianfei Cai,et al.  Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features , 2018, ECCV.

[42]  Meng Wang,et al.  Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[43]  Qingming Huang,et al.  Spatial Pyramid-Enhanced NetVLAD With Weighted Triplet Loss for Place Recognition , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[44]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Xuelong Li,et al.  Learning Parts-Based and Global Representation for Image Classification , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[47]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[48]  Jonathan Berant,et al.  Learning to generalize to new compositions in image understanding , 2016, ArXiv.

[49]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Samy Bengio,et al.  Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Chao Xu,et al.  Robust Visual Tracking via Multi-Scale Spatio-Temporal Context Learning , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[53]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Weijian Li,et al.  Attentive Relational Networks for Mapping Images to Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[57]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[58]  Yongdong Zhang,et al.  Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning , 2020, IEEE Transactions on Multimedia.

[59]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[60]  Shi-Min Hu,et al.  S4Net: Single stage salient-instance segmentation , 2017, Computational Visual Media.

[61]  An-An Liu,et al.  3D Object Retrieval Based on Multi-View Latent Variable Model , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[62]  Juan-Zi Li,et al.  Explainable and Explicit Visual Reasoning Over Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Guosheng Lin,et al.  Exploring Context with Deep Structured Models for Semantic Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Yongdong Zhang,et al.  Dual-Stream Recurrent Neural Network for Video Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[68]  Ke Lu,et al.  Heterogeneous Domain Adaptation Through Progressive Alignment , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[69]  Gang Wang,et al.  Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[71]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[72]  Ali Farhadi,et al.  Incorporating Scene Context and Object Layout into Appearance Modeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Mohan S. Kankanhalli,et al.  Dual-Glance Model for Deciphering Social Relationships , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[74]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[77]  Jun Yu,et al.  Hierarchical Deep Click Feature Prediction for Fine-Grained Image Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.