论文信息 - Learning Relationship-Aware Visual Features

Learning Relationship-Aware Visual Features

Relational reasoning in Computer Vision has recently shown impressive results on visual question answering tasks. On the challenging dataset called CLEVR, the recently proposed Relation Network (RN), a simple plug-and-play module and one of the state-of-the-art approaches, has obtained a very good accuracy (95.5%) answering relational questions. In this paper, we define a sub-field of Content-Based Image Retrieval (CBIR) called Relational-CBIR (R-CBIR), in which we are interested in retrieving images with given relationships among objects. To this aim, we employ the RN architecture in order to extract relation-aware features from CLEVR images. To prove the effectiveness of these features, we extended both CLEVR and Sort-of-CLEVR datasets generating a ground-truth for R-CBIR by exploiting relational data embedded into scene-graphs. Furthermore, we propose a modification of the RN module – a two-stage Relation Network (2S-RN) – that enabled us to extract relation-aware features by using a preprocessing stage able to focus on the image content, leaving the question apart. Experiments show that our RN features, especially the 2S-RN ones, outperform the RMAC state-of-the-art features on this new challenging task.

[1] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[2] Ronan Sicre,et al. Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[3] Matthew B. Blaschko,et al. Joint Embeddings of Scene Graphs and Images , 2017, ICLR.

[4] Bo Dai,et al. Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Kevin Chen-Chuan Chang,et al. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[6] Aaron C. Courville,et al. Learning Visual Reasoning Without Strong Priors , 2017, ICML 2017.

[7] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[8] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Trevor Darrell,et al. Segmentation from Natural Language Expressions , 2016, ECCV.

[10] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.

[11] Albert Gordo,et al. End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[12] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13] Daniel C. Krawczyk,et al. A hierarchy for relational reasoning in the prefrontal cortex , 2011, Cortex.

[14] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[15] Jean-Yves Ramel,et al. An Exact Graph Edit Distance Algorithm for Solving Pattern Recognition Problems , 2015, ICPRAM.

[16] Kaspar Riesen,et al. Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[17] Ivan Laptev,et al. Weakly-Supervised Learning of Visual Relations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18] Michael S. Bernstein,et al. Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[20] Li Fei-Fei,et al. Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Massimo Melucci,et al. On rank correlation in information retrieval evaluation , 2007, SIGF.

[22] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[23] Razvan Pascanu,et al. Discovering objects and their relations from entangled scene representations , 2017, ICLR.

[24] Ji Zhang,et al. Large-Scale Visual Relationship Understanding , 2018, AAAI.

[25] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).