MR-NET: Exploiting Mutual Relation for Visual Relationship Detection

Inferring the interactions between objects, a.k.a visual relationship detection, is a crucial point for vision understanding, which captures more definite concepts than object detection. Most previous work that treats the interaction between a pair of objects as a one way fail to exploit the mutual relation between objects, which is essential to modern visual application. In this work, we propose a mutual relation net, dubbed MR-Net, to explore the mutual relation between paired objects for visual relationship detection. Specifically, we construct a mutual relation space to model the mutual interaction of paired objects, and employ linear constraint to optimize the mutual interaction, which is called mutual relation learning. Our mutual relation learning does not introduce any parameters, and can adapt to improve the performance of other methods. In addition, we devise a semantic ranking loss to discriminatively penalize predicates with semantic similarity, which is ignored by traditional loss function (e.g., cross entropy with softmax). Then, our MR-Net optimizes the mutual relation learning together with semantic ranking loss with a siamese network. The experimental results on two commonly used datasets (VG and VRD) demonstrate the superior performance of the proposed approach.

[1]  Zi Huang,et al.  I read, I saw, I tell: Texts Assisted Fine-Grained Visual Classification , 2018, ACM Multimedia.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Yang Yang,et al.  Word-to-region attention network for visual question answering , 2018, Multimedia Tools and Applications.

[5]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xilin Chen,et al.  Visual Relationship Detection With Deep Structural Ranking , 2018, AAAI.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Heng Tao Shen,et al.  Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA , 2018, ACM Multimedia.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Shuqiang Jiang,et al.  Deep Structured Learning for Visual Relationship Detection , 2018, AAAI.

[13]  Xiangyang Li,et al.  Visual relationship detection with object spatial distribution , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[14]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[15]  Ling Shao,et al.  Learning to Synthesize 3D Indoor Scenes from Monocular Images , 2018, ACM Multimedia.

[16]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Ivan Laptev,et al.  Weakly-Supervised Learning of Visual Relations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Shihong Lao,et al.  Group Tracking: Exploring Mutual Relations for Multiple Object Tracking , 2012, ECCV.

[19]  Jian Yang,et al.  Marginal Representation Learning With Graph Structure Self-Adaptation , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[22]  Zi Huang,et al.  Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning , 2017, ACM Multimedia.

[23]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[24]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[25]  Xiaogang Wang,et al.  Deep Continuous Conditional Random Fields With Asymmetric Inter-Object Constraints for Online Multi-Object Tracking , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[28]  Xuelong Li,et al.  Describing Video With Attention-Based Bidirectional LSTM , 2019, IEEE Transactions on Cybernetics.

[29]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[30]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Li Liu,et al.  Visual Spatial Attention Network for Relationship Detection , 2018, ACM Multimedia.

[33]  Ling Shao,et al.  Binary Multi-View Clustering , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Eric P. Xing,et al.  Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).