Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues

This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues.1 We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. We pay special attention to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption. The resulting system produces a 5% improvement in accuracy over the state of the art on phrase localization on the Flickr30k Entities dataset [25] and a 4-10% improvement for visual relationship detection on the Stanford VRD dataset [20].

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Licheng Yu,et al.  Visual Madlibs: Fill in the blank Image Generation and Question Answering , 2015, ArXiv.

[4]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[6]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[7]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[10]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[12]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[13]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[14]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[15]  Svetlana Lazebnik,et al.  Scene Parsing with Object Instances and Occlusion Ordering , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[17]  Sanja Fidler,et al.  A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[19]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[20]  Ruslan Mitkov,et al.  Robust Pronoun Resolution with Limited Knowledge , 1998, ACL.

[21]  Ali Farhadi,et al.  VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[23]  Sanda M. Harabagiu,et al.  Knowledge-Lean Coreference Resolution and its Relation to Textual Cohesion and Coherence , 1999, Workshop On The Relation Of Discourse/Dialogue Structure And Reference.

[24]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[25]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[27]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[28]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[32]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[35]  Rada Mihalcea,et al.  Structured Matching for Phrase Localization , 2016, ECCV.

[36]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.