Learning to transfer focus of graph neural network for scene graph parsing

Abstract Scene graph parsing has become a new challenge in the field of image understanding and pattern recognition in recent years. It captures objects and their relationships, and provides a structured representation of the visual scene. Among the three types of high-level relationships of scene graphs, semantic relationships, which contain the global understanding of the scene, are the core and the most valuable, while geometric and possessive relationships contain local and limited information. However, semantic relationships have the characteristics of multiple types and fewer instances, leading to a low recognition rate of most semantic relationships by existing detectors. To address this issue, this paper proposes a new architecture, the graphical focal network, which uses a decision-level global detector to capture the dependencies between object and relationship local detectors. We construct a graphical focal loss, which overcomes the lack of semantic relationship instances by adjusting the proportion of relationship loss based on the degree of relationship rarity and learning difficulty, and improves the stability of key object recognition by adjusting the proportion of object loss based on the degree of node connectivity and the value of neighborhood relationships. The proposed relative depth encoding module and regional layout encoding module, respectively, introduce relative depth information and more effective geometric layout information between objects, thereby further improving the performance. Experiments using the Visual Genome benchmark show that our method outperforms the most advanced competitors in two types of performance metrics. For semantic types, the recognition rate of our method is 2.0 times that of the baseline.

[1]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[3]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Silvio Savarese,et al.  Watch-n-Patch: Unsupervised Learning of Actions and Relations , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  José Eladio Medina-Pagola,et al.  A new proposal for graph-based image classification using frequent approximate subgraphs , 2014, Pattern Recognit..

[6]  Chitta Baral,et al.  Image Understanding using vision and reasoning through Scene Description Graph , 2018, Comput. Vis. Image Underst..

[7]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[8]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Robert Bergevin,et al.  Semantic human activity recognition: A literature review , 2015, Pattern Recognit..

[11]  Roberto Marcondes Cesar Junior,et al.  Modeling and measuring the spatial relation "along": Regions, contours and fuzzy sets , 2012, Pattern Recognit..

[12]  Luke S. Zettlemoyer,et al.  Deep Semantic Role Labeling: What Works and What’s Next , 2017, ACL.

[13]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Angel X. Chang,et al.  Semantic Parsing for Text to 3D Scene Generation , 2014, ACL 2014.

[15]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[16]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[17]  Bo Li,et al.  Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference , 2017, Pattern Recognit..

[18]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Laurent Wendling,et al.  Learning spatial relations and shapes for structural object description and scene recognition , 2018, Pattern Recognit..

[20]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[21]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[22]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Ying Chen,et al.  M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network , 2018, AAAI.

[27]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Serge J. Belongie,et al.  Context based object categorization: A critical survey , 2010, Comput. Vis. Image Underst..

[29]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[30]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[34]  Katsushi Ikeuchi,et al.  Scene Understanding by Reasoning Stability and Safety , 2015, International Journal of Computer Vision.

[35]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[37]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[39]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Huchuan Lu,et al.  Salient object detection via global and local cues , 2015, Pattern Recognit..

[41]  Isabelle Bloch,et al.  Directional relative position between objects in image processing: a comparison between fuzzy approaches , 2003, Pattern Recognit..

[42]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[43]  Derek Hoiem,et al.  Complete 3D Scene Parsing from an RGBD Image , 2018, International Journal of Computer Vision.