Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code is available at https://github.com/coldmanck/RVL-BERT.

[1]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[2]  Ruiqin Xiong,et al.  Visual Relationship Embedding Network for Image Paragraph Generation , 2020, IEEE Transactions on Multimedia.

[3]  Xu Zhao,et al.  Context-Associative Hierarchical Memory Model for Human Activity Recognition and Prediction , 2017, IEEE Transactions on Multimedia.

[4]  Jonathan Berant,et al.  Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction , 2018, NeurIPS.

[5]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[6]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jiebo Luo,et al.  Understanding Kin Relationships in a Photo , 2012, IEEE Transactions on Multimedia.

[8]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[10]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[11]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[12]  Mohan S. Kankanhalli,et al.  Interact as You Intend: Intention-Driven Human-Object Interaction Detection , 2018, IEEE Transactions on Multimedia.

[13]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[14]  Shih-Fu Chang,et al.  Bridging Knowledge Graphs to Generate Scene Graphs , 2020, ECCV.

[15]  Ivan Laptev,et al.  Weakly-Supervised Learning of Visual Relations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Juan-Zi Li,et al.  Explainable and Explicit Visual Reasoning Over Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[21]  Ji Zhang,et al.  Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Vikas Singh,et al.  Tensorize, Factorize and Regularize: Robust Visual Relationship Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Jia Deng,et al.  Pixels to Graphs by Associative Embedding , 2017, NIPS.

[24]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[25]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[26]  Shuqiang Jiang,et al.  Know More Say Less: Image Captioning Based on Scene Graphs , 2019, IEEE Transactions on Multimedia.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29]  Weijian Li,et al.  Attentive Relational Networks for Mapping Images to Scene Graphs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[32]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[33]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[36]  Olga Russakovsky,et al.  SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[39]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[43]  Jongyoul Park,et al.  Visual Relationship Detection with Language prior and Softmax , 2018, 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS).

[44]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[45]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Svetlana Lazebnik,et al.  Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Dan Zeng,et al.  Exploring Depth Information for Spatial Relation Recognition , 2020, 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[51]  Zhenzhong Chen,et al.  Hierarchical Graph Attention Network for Visual Relationship Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Wei Liu,et al.  Learning to Compose Dynamic Tree Structures for Visual Contexts , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Jun Yu,et al.  On Exploring Undetermined Relationships for Visual Relationship Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  In-So Kweon,et al.  LinkNet: Relational Embedding for Scene Graph , 2018, NeurIPS.

[58]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[60]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[63]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Ning Xu,et al.  Scene graph captioner: Image captioning based on structural visual representation , 2019, J. Vis. Commun. Image Represent..