Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping

Referring expression comprehension, which locates the object instance described by a natural language expression, gains increasing interests in recent years. This paper aims at improving the task from two aspects: visual feature extraction and language features extraction. For visual feature extraction, we observe that most of the previous methods utilize only relative spatial information to model the visual relationship between object pairs while discarding rich semantic relationship between objects. This makes the visual-language matching difficult when the language expression contains semantic relationship to discriminate the referred object from other objects in the image. In this work, we propose a Semantic Visual Relationship Module (SVRM) to exploit this important information. For language feature extraction, a major problem comes from the long-tail distribution of words in the expressions. Since more than half of the words appear less than 20 times in the public datasets, deep models such as LSTM tend to fail to learn accurate representations for these words. To solve this problem, we propose a word2vec based word mapping method that maps these low frequency words to high frequency words with similar meaning. Experiments show that the proposed method outperforms existing state-of-the-art methods on three referring expression comprehension datasets.

[1]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[5]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[6]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samy Bengio,et al.  Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[13]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Gregory Shakhnarovich,et al.  Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[21]  Xiaogang Wang,et al.  Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation , 2018, ECCV.

[22]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[29]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[30]  Liang Wang,et al.  Referring Expression Generation and Comprehension via Attributes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).