论文信息 - Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation

Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation

We propose a new algorithm, called Deep Generative Probabilistic Graph Neural Networks (DG-PGNN), to generate a scene graph for an image. The input to DG-PGNN is an image, together with a set of region-grounded captions and object bounding-box proposals for the image. To generate the scene graph, DG-PGNN constructs and updates a new model, called a Probabilistic Graph Network (PGN). A PGN can be thought of as a scene graph with uncertainty: it represents each node and each edge by a CNN feature vector and defines a probability mass function (PMF) for node-type (object category) of each node and edge-type (predicate class) of each edge. The DG-PGNN sequentially adds a new node to the current PGN by learning the optimal ordering in a Deep Q-learning framework, where states are partial PGNs, actions choose a new node, and rewards are defined based on the ground-truth. After adding a node, DG-PGNN uses message passing to update the feature vectors of the current PGN by leveraging contextual relationship information, object co-occurrences, and language priors from captions. The updated features are then used to fine-tune the PMFs. Our experiments show that the proposed algorithm significantly outperforms the state-of-the-art results on the Visual Genome dataset for scene graph generation. We also show that the scene graphs constructed by DG-PGNN improve performance on the visual question answering task, for questions that need reasoning about objects and their interactions in the scene context.

Oliver Schulte | Mahmoud Khademi | O. Schulte | M. Khademi

[1] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[2] Yu Zhang,et al. Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Razvan Pascanu,et al. Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[4] Danfei Xu,et al. Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[6] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Jia Deng,et al. Pixels to Graphs by Associative Embedding , 2017, NIPS.

[8] Xiaogang Wang,et al. Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Wei Liu,et al. Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Stefan Lee,et al. Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[11] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[12] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.