Classification by Attention: Scene Graph Classification with Prior Knowledge

A main challenge in scene graph classification is that the appearance of objects and relations can be significantly different from one image to another. Previous works have addressed this by relational reasoning over all objects in an image, or incorporating prior knowledge into classification. Unlike previous works, we do not consider separate models for the perception and prior knowledge. Instead, we take a multi-task learning approach, where the classification is implemented as an attention layer. This allows for the prior knowledge to emerge and propagate within the perception model. By enforcing the model to also represent the prior, we achieve a strong inductive bias. We show that our model can accurately generate commonsense knowledge and that the iterative injection of this knowledge to scene representations leads to a significantly higher classification performance. Additionally, our model can be fine-tuned on external knowledge given as triples. When combined with self-supervised learning, this leads to accurate predictions with 1% of annotated images only.

[1]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[2]  Immanuel Kant Kritik der reinen Vernunft: [Hauptband] , 1900 .

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[5]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[6]  Volker Tresp,et al.  The Tensor Brain: Semantic Decoding for Perception and Memory , 2020, ArXiv.

[7]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[8]  Volker Tresp,et al.  Improving Information Extraction from Images with Learned Semantic Models , 2018, IJCAI.

[9]  Michael S. Bernstein,et al.  Scene Graph Prediction with Limited Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[12]  Dileep George,et al.  Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics , 2017, ICML.

[13]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[14]  Volker Tresp,et al.  Improving Visual Relationship Detection Using Semantic Modeling of Scene Descriptions , 2017, SEMWEB.

[15]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[16]  Sir Frederic Bartlett Schema Theory , 2017 .

[17]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[18]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[19]  Yoshua Bengio,et al.  The Consciousness Prior , 2017, ArXiv.

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[21]  Hans-Peter Kriegel,et al.  A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[22]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[23]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[27]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[28]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[29]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[30]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[32]  Artur Buchenau Kritik der reinen Vernunft , 1928 .

[33]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Volker Tresp,et al.  A Model for Perception and Memory , 2019, 2019 Conference on Cognitive Computational Neuroscience.

[36]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[38]  Jiebo Luo,et al.  Relational Reasoning using Prior Knowledge for Visual Captioning , 2019, ArXiv.

[39]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[40]  Michael S. Bernstein,et al.  Scene Graph Prediction with Limited Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[41]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[42]  Volker Tresp,et al.  Improving Visual Relation Detection using Depth Maps , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[43]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[44]  Greg Mori,et al.  LabelBank: Revisiting Global Perspectives for Semantic Segmentation , 2017, ArXiv.

[45]  Yoshua Bengio,et al.  Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems , 2020, ArXiv.

[46]  Greg Mori,et al.  Learning Structured Inference Neural Networks with Label Relations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Zenon W. Pylyshyn,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[52]  Murray Shanahan,et al.  Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[53]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Bernhard Schölkopf,et al.  Recurrent Independent Mechanisms , 2021, ICLR.

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56]  Hefeng Wu,et al.  Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).