Mask and Predict: Multi-step Reasoning for Scene Graph Generation

Scene Graph Generation (SGG) aims to parse the image as a set of semantics, containing objects and their relations. Currently, the SGG methods only stay at presenting the intuitive detection in the image, such as the triplet "logo on board". Intuitively, we humans can further refine these intuitive detections as rational descriptions like "flower painted on surfboard". However, most of existing methods always formulate SGG as a straightforward task, only limited by the manner of one-time prediction, which focuses on a single-pass pipeline and predicts all the semantic. Therefore, to handle this problem, we propose a novel multi-step reasoning manner for SGG. Concretely, we break SGG into two explicit learning stages, including intuitive training stage (ITS) and rational training stage (RTS). In the first stage, we follow the traditional SGG processing to detect objects and relationships, yielding an intuitive scene graph. In the second stage, we perform multi-step reasoning to refine the intuitive scene graph. For each step of reasoning, it consists of two kinds of operations: mask and predict. According to primary predictions and their confidences, we constantly select and mask the low-confidence predictions, which features are optimized and predicted again. After several iterations, all of intuitive semantics will gradually tend to be revised with high confidences, yielding a rational scene graph. Extensive experiments on Visual Genome prove the superiority of the proposed method. Additional ablation studies and visualization cases further validate its effectiveness.

[1]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Cheng Zhang,et al.  An Empirical Study on Leveraging Scene Graphs for Visual Question Answering , 2019, BMVC.

[5]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Shih-Fu Chang,et al.  Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[7]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[8]  Chunxiao Liu,et al.  Graph Structured Network for Image-Text Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[10]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Xiaodong Liu,et al.  Stochastic Answer Networks for Machine Reading Comprehension , 2017, ACL.

[12]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yongdong Zhang,et al.  Adaptively Clustering-Driven Learning for Visual Relationship Detection , 2020, IEEE Transactions on Multimedia.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Juan-Zi Li,et al.  Explainable and Explicit Visual Reasoning Over Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tao Yuan,et al.  Scene-Centric Joint Parsing of Cross-View Videos , 2017, AAAI.

[17]  Weijian Li,et al.  Attentive Relational Networks for Mapping Images to Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[21]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[22]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Meng Wang,et al.  Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud , 2017, IEEE Transactions on Image Processing.

[24]  Heng Tao Shen,et al.  One-shot Scene Graph Generation , 2020, ACM Multimedia.

[25]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Weifeng Zhang,et al.  Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue , 2020, IEEE Transactions on Image Processing.

[28]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[29]  J. Wolfe Visual memory: What do you know about what you saw? , 1998, Current Biology.

[30]  Huaxiang Zhang,et al.  Flexible Multi-modal Hashing for Scalable Multimedia Retrieval , 2020, ACM Trans. Intell. Syst. Technol..

[31]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[32]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[37]  Quan Hung Tran,et al.  Context-Aware Group Captioning via Self-Attention and Contrastive Features , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jianfei Cai,et al.  Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features , 2018, ECCV.

[39]  Xian-Sheng Hua,et al.  PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation , 2020, ACM Multimedia.

[40]  Shouling Ji,et al.  Deep Dual Consecutive Network for Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Zhenguang Liu,et al.  Combining Graph Neural Networks With Expert Knowledge for Smart Contract Vulnerability Detection , 2021, IEEE Transactions on Knowledge and Data Engineering.

[42]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[43]  Shuqiang Jiang,et al.  Deep Structured Learning for Visual Relationship Detection , 2018, AAAI.

[44]  Jing Yu,et al.  CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation , 2020, IJCAI.

[45]  Yelong Shen,et al.  ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[46]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[47]  Zi Huang,et al.  ORD: Object Relationship Discovery for Visual Dialogue Generation , 2020, ArXiv.

[48]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Lu Yuan,et al.  Rethinking Classification and Localization for Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Xinlei Chen,et al.  Spatial Memory for Context Reasoning in Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Xinlei Chen,et al.  Iterative Visual Reasoning Beyond Convolutions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Peng Wang,et al.  Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Jinquan Zeng,et al.  GPS-Net: Graph Property Sensing Network for Scene Graph Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).