Deep Structured Learning for Visual Relationship Detection

In the research area of computer vision and artificial intelligence, learning the relationships of objects is an important way to deeply understand images. Most of recent works detect visual relationship by learning objects and predicates respectively in feature level, but the dependencies between objects and predicates have not been fully considered. In this paper, we introduce deep structured learning for visual relationship detection. Specifically, we propose a deep structured model, which learns relationship by using feature-level prediction and label-level prediction to improve learning ability of only using feature-level predication. The feature-level prediction learns relationship by discriminative features, and the label-level prediction learns relationships by capturing dependencies between objects and predicates based on the learnt relationship of feature level. Additionally, we use structured SVM (SSVM) loss function as our optimization goal, and decompose this goal into the subject, predicate, and object optimizations which become more simple and more independent. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome (VG) dataset validate the effectiveness of our method, which outperforms state-of-the-art methods.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[3]  Andrew McCallum,et al.  Structured Prediction Energy Networks , 2015, ICML.

[4]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Stephen Gould,et al.  Multi-Class Segmentation with Relative Location Prior , 2008, International Journal of Computer Vision.

[6]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[7]  Xiangyang Li,et al.  Visual relationship detection with object spatial distribution , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[9]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[10]  Eric P. Xing,et al.  Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alan L. Yuille,et al.  Learning Deep Structured Models , 2014, ICML.

[12]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ali Farhadi,et al.  Incorporating Scene Context and Object Layout into Appearance Modeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Luc Van Gool,et al.  Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[20]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[26]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[28]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .