Learning Relationship-Aware Visual Features

Relational reasoning in Computer Vision has recently shown impressive results on visual question answering tasks. On the challenging dataset called CLEVR, the recently proposed Relation Network (RN), a simple plug-and-play module and one of the state-of-the-art approaches, has obtained a very good accuracy (95.5%) answering relational questions. In this paper, we define a sub-field of Content-Based Image Retrieval (CBIR) called Relational-CBIR (R-CBIR), in which we are interested in retrieving images with given relationships among objects. To this aim, we employ the RN architecture in order to extract relation-aware features from CLEVR images. To prove the effectiveness of these features, we extended both CLEVR and Sort-of-CLEVR datasets generating a ground-truth for R-CBIR by exploiting relational data embedded into scene-graphs. Furthermore, we propose a modification of the RN module – a two-stage Relation Network (2S-RN) – that enabled us to extract relation-aware features by using a preprocessing stage able to focus on the image content, leaving the question apart. Experiments show that our RN features, especially the 2S-RN ones, outperform the RMAC state-of-the-art features on this new challenging task.

[1]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[2]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[3]  Matthew B. Blaschko,et al.  Joint Embeddings of Scene Graphs and Images , 2017, ICLR.

[4]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kevin Chen-Chuan Chang,et al.  A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[6]  Aaron C. Courville,et al.  Learning Visual Reasoning Without Strong Priors , 2017, ICML 2017.

[7]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[8]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[10]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[11]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[12]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Daniel C. Krawczyk,et al.  A hierarchy for relational reasoning in the prefrontal cortex , 2011, Cortex.

[14]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[15]  Jean-Yves Ramel,et al.  An Exact Graph Edit Distance Algorithm for Solving Pattern Recognition Problems , 2015, ICPRAM.

[16]  Kaspar Riesen,et al.  Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[17]  Ivan Laptev,et al.  Weakly-Supervised Learning of Visual Relations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[20]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Massimo Melucci,et al.  On rank correlation in information retrieval evaluation , 2007, SIGF.

[22]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[23]  Razvan Pascanu,et al.  Discovering objects and their relations from entangled scene representations , 2017, ICLR.

[24]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[25]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).