Improving Scene Graph Classification by Exploiting Knowledge from Texts

Training scene graph classification models requires a large amount of annotated image data. Meanwhile, scene graphs represent relational knowledge that can be modeled with symbolic data from texts or knowledge graphs. While image annotation demands extensive labor, collecting textual descriptions of natural scenes requires less effort. In this work, we investigate whether textual scene descriptions can substitute for annotated image data. To this end, we employ a scene graph classification framework that is trained not only from annotated images but also from symbolic data. In our architecture, the symbolic entities are first mapped to their correspondent image-grounded representations and then fed into the relational reasoning pipeline. Even though a structured form of knowledge, such as the form in knowledge graphs, is not always available, we can generate it from unstructured texts using a transformer-based language model. We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve ∼8x more accurate results in scene graph classification, ∼3x in object classification, and ∼1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images. Introduction Relational reasoning is one of the essential components of intelligence; humans explore their environment by grasping the entire context of a scene rather than studying each item in isolation from the others. Furthermore, we expand our understanding of the world by educating ourselves about novel facts through reading or listening. For example, we might have never seen a “cow wearing a dress” but might have read about Hindu traditions of decorating cows. While we already have a robust visual system that can extract basic visual features such as edges and curves from a scene, the description of a “cow wearing a dress” refines our visual understanding of relations on an object level and enables us to recognize a dressed cow when seeing it. Relational reasoning is gaining growing popularity in the Computer Vision community and especially in the form of *These authors contributed equally. S. M. Baharlou contributed to this project while he was a visiting researcher at the Ludwig Maximilian University of Munich. Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. scene graph (SG) classification. The goal of SG classification is to classify objects and their relations in an image. One of the challenges in SG classification is collecting annotated image data. Most approaches in this domain rely on thousands of manually labeled and curated images. In this paper, we investigate whether the SG classification models can be finetuned from textual scene descriptions (similar to the “dressed cow” example above). We consider a classification pipeline with two major parts: a feature extraction backbone, and a relational reasoning component (Figure 1). The backbone is typically a convolutional neural network (CNN) that detects objects and extracts an image-based representation for each. On the other hand, the relational reasoning component can be a variant of a recurrent neural network [Xu et al. 2017, Zellers et al. 2018] or graph convolutional networks [Yang et al. 2018, Sharifzadeh, Baharlou, and Tresp 2021]. This component operates on an object level by taking the latent representations of all the objects in the image and propagating them in the graph. Note that, unlike the feature extraction backbone that requires images as input, the relational reasoning component operates on graphs with the nodes representing objects and the edges representing relations. The distinction between the input to the backbone (images) and the relational reasoning component (graphs) is often overlooked. Instead, the scene graph classification pipeline is treated as a network that takes only images as inputs. However, one can also train or fine-tune the relational reasoning component directly by injecting it with relational knowledge. For example, Knowledge Graphs (KGs) contain curated facts that indicate the relations between a head object and a tail object in the form of (head, predicate, tail) e.g., (Person, Rides, Horse). The facts in KGs are represented by symbols whereas the inputs to the relational reasoning component are image-based embeddings. In this work, we map the triples to image-grounded embeddings as if they are coming from an image. We then use these embeddings to finetune the relational reasoning component through a denoising graph autoencoder scheme. Note that the factual knowledge is not always available in a well-structured form, specially in domains where the knowledge is not stored in the machine-accessible form of KGs. In fact, most of the collective human knowledge is only ar X iv :2 10 2. 04 76 0v 2 [ cs .C V ] 8 O ct 2 02 1

[1]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[4]  Xilin Chen,et al.  Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation , 2020, ECCV.

[5]  Volker Tresp,et al.  Improving Information Extraction from Images with Learned Semantic Models , 2018, IJCAI.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[9]  Shih-Fu Chang,et al.  Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[10]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Thomas A. Runkler,et al.  Neural Relation Extraction within and across Sentence Boundaries , 2019, AAAI.

[13]  Volker Tresp,et al.  Improving Visual Relation Detection using Depth Maps , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[14]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[15]  Greg Mori,et al.  LabelBank: Revisiting Global Perspectives for Semantic Segmentation , 2017, ArXiv.

[16]  Volker Tresp,et al.  PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings , 2020, J. Mach. Learn. Res..

[17]  Yong Jae Lee,et al.  DOCK: Detecting Objects by Transferring Common-Sense Knowledge , 2018, ECCV.

[18]  Nancy Chinchor MUC-3 linguistic phenomena test experiment , 1991, MUC.

[19]  Greg Mori,et al.  Learning Structured Inference Neural Networks with Label Relations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Volker Tresp,et al.  The Tensor Brain: Semantic Decoding for Perception and Memory , 2020, ArXiv.

[21]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[24]  Liang Lin,et al.  Hybrid Knowledge Routed Modules for Large-scale Object Detection , 2018, NeurIPS.

[25]  Volker Tresp,et al.  An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing , 2019, EMNLP.

[26]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[27]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Abhinav Gupta,et al.  Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Heike Adel,et al.  Noise Mitigation for Neural Entity Typing and Relation Extraction , 2016, EACL.

[30]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Volker Tresp,et al.  A Model for Perception and Memory , 2019, 2019 Conference on Cognitive Computational Neuroscience.

[32]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Volker Tresp,et al.  Classification by Attention: Scene Graph Classification with Prior Knowledge , 2020, AAAI.

[34]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[35]  Volker Tresp,et al.  Bringing Light Into the Dark: A Large-Scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[38]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[41]  Volker Tresp,et al.  Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions , 2017, SEMWEB.

[42]  Shih-Fu Chang,et al.  Learning Visual Commonsense for Robust Scene Graph Generation: Supplementary Material , 2020 .

[43]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[44]  Iryna Gurevych,et al.  Investigating Pretrained Language Models for Graph-to-Text Generation , 2020, ArXiv.

[45]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).