Hierarchical Visual Relationship Detection

Acting as a bridge between vision and language, visual relationship detection (VRD) aims to represent objects and their interactions in an image with several relationship triplets. Nevertheless, the conventional VRD task shows little consideration for the penalization of incorrect relationship predictions, which in turn undermines its support for image understanding applications. In this paper, we propose a novel VRD task named hierarchical visual relationship detection (HVRD), which encourages predictions with abstract yet compatible relationship triplets when the confidence level of the specific image content is relatively low. Meanwhile, HVRD can handle the inevitable ambiguity of groundtruth annotation in VRD. Based on this, we propose a HVRD method, consisting of hierarchical object detection and hierarchical predicate detection. It can effectively detect the hierarchical visual relationships by exploiting both object concept hierarchy and predicate concept hierarchy with order embedding. We also propose the first datasets for HVRD evaluation, H-VRD and H-VG, by expanding the relationship category spaces of VRD and VG datasets to hierarchical ones respectively. The experimental results show that our method is superior to the state-of-the-art baselines.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Eric P. Xing,et al.  Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[8]  Chongyang Zhang,et al.  Visual Relationship Recognition via Language and Position Guided Attention , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[13]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[14]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[15]  Nenghai Yu,et al.  Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[16]  Jonathan Krause,et al.  Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Shuqiang Jiang,et al.  Deep Structured Learning for Visual Relationship Detection , 2018, AAAI.

[19]  Alexander C. Berg,et al.  Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition , 2011, NIPS.

[20]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[21]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[24]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Mubarak Shah,et al.  Improving Semantic Concept Detection and Retrieval using Contextual Estimates , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[30]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jianfei Cai,et al.  Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features , 2018, ECCV.

[32]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[33]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[34]  Vikas Singh,et al.  Tensorize, Factorize and Regularize: Robust Visual Relationship Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[36]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[37]  Xilin Chen,et al.  Visual Relationship Detection With Deep Structural Ranking , 2018, AAAI.

[38]  Tat-Seng Chua,et al.  Video Visual Relation Detection , 2017, ACM Multimedia.

[39]  Kevin P. Murphy,et al.  Probabilistic Label Relation Graphs with Ising Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Bolei Zhou,et al.  Open Vocabulary Scene Parsing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Jian Yang,et al.  Context-Dependent Diffusion Network for Visual Relationship Detection , 2018, ACM Multimedia.

[44]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[45]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[46]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.