论文信息 - CPARR: Category-based Proposal Analysis for Referring Relationships

CPARR: Category-based Proposal Analysis for Referring Relationships

The task of referring relationships is to localize subject and object entities in an image satisfying a relationship query, which is given in the form of . This requires simultaneous localization of the subject and object entities in a specified relationship. We introduce a simple yet effective proposal-based method for referring relationships. Different from the existing methods such as SSAS, our method can generate a high-resolution result while reducing its complexity and ambiguity. Our method is composed of two modules: a category-based proposal generation module to select the proposals related to the entities and a predicate analysis module to score the compatibility of pairs of selected proposals. We show state-of-the-art performance on the referring relationship task on two public datasets: Visual Relationship Detection and Visual Genome.

Kan Chen | Haidong Zhu | Ram Nevatia | Chuanzi He | Jiyang Gao

[1] Rada Mihalcea,et al. Structured Matching for Phrase Localization , 2016, ECCV.

[2] Song-Chun Zhu,et al. Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[3] Ramakant Nevatia,et al. Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Xilin Chen,et al. Visual Relationship Detection With Deep Structural Ranking , 2018, AAAI.

[5] Svetlana Lazebnik,et al. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Stefan Lee,et al. Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[7] Xiaogang Wang,et al. Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[8] Zheng-Jun Zha,et al. Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding , 2019, ACM Multimedia.

[9] Xuming He,et al. Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Xiaogang Wang,et al. Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[14] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Nenghai Yu,et al. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition , 2018, ECCV.

[16] Larry S. Davis,et al. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17] Mohan S. Kankanhalli,et al. Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[19] Ramakant Nevatia,et al. Knowledge Aided Consistency for Weakly Supervised Phrase Grounding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Xinlei Chen,et al. An Implementation of Faster RCNN with Study for Region Sampling , 2017, ArXiv.

[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Danfei Xu,et al. Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Ramakant Nevatia,et al. MSRC: multimodal spatial regression with semantic context for phrase grounding , 2017, International Journal of Multimedia Information Retrieval.

[25] Licheng Yu,et al. MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] Liang Lin,et al. Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Serge J. Belongie,et al. Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Xiaogang Wang,et al. ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[30] Chen Gao,et al. iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[31] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33] Cordelia Schmid,et al. Detecting Unseen Visual Relations Using Analogies , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Jianfei Cai,et al. Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features , 2018, ECCV.

[35] Michael S. Bernstein,et al. Referring Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Bo Dai,et al. Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Kaiming He,et al. Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Trevor Darrell,et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Kan Chen,et al. Zero-Shot Grounding of Objects From Natural Language Queries , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] D. LaBerge,et al. Shifting attention in visual space: tests of moving-spotlight models versus an activity-distribution model. , 1997, Journal of experimental psychology. Human perception and performance.

[41] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[42] Jia Deng,et al. Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43] Cewu Lu,et al. Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).