Revisiting Image-Language Networks for Open-ended Phrase Detection.

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image \textbf{and} localize the phrase. This can also be viewed as a generalization of object detection to an open-ended vocabulary, introducing elements of few- and zero-shot detection. We propose an approach for this task that extends Faster R-CNN to relate image regions and phrases. By carefully initializing the classification layers of our network using canonical correlation analysis (CCA), we encourage a solution that is more discerning when reasoning between similar phrases, resulting in over double the performance compared to a naive adaptation on three popular phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, with test-time phrase vocabulary sizes of 5K, 32K, and 159K, respectively.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[6]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Gregory Shakhnarovich,et al.  Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[15]  Cordelia Schmid,et al.  Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Liang Wang,et al.  Referring Expression Generation and Comprehension via Attributes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[20]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[21]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[22]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  Shin'ichi Satoh,et al.  Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation , 2017, EMNLP.

[25]  Markus H. Gross,et al.  Neural Sequential Phrase Grounding (SeqGROUND) , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[29]  Robinson Piramuthu,et al.  Conditional Image-Text Embedding Networks , 2017, ECCV.

[30]  Ross B. Girshick,et al.  Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jiyang Gao,et al.  MSRC: Multimodal Spatial Regression with Semantic Context for Phrase Grounding , 2017, ICMR.

[32]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[33]  Yuting Zhang,et al.  Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[35]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.

[37]  Jinjun Xiong,et al.  Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts , 2018, NIPS.

[38]  Leonid Sigal,et al.  G3raphGround: Graph-Based Language Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Rada Mihalcea,et al.  Structured Matching for Phrase Localization , 2016, ECCV.

[41]  Ramakant Nevatia,et al.  Query-Guided Regression Network with Context Policy for Phrase Grounding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Rodrigo C. Barros,et al.  Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[45]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Kate Saenko,et al.  Language Features Matter: Effective Language Representations for Vision-Language Tasks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Fang Zhao,et al.  Deep Attribute-preserving Metric Learning for Natural Language Object Retrieval , 2017, ACM Multimedia.

[50]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[52]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Christopher Kanan,et al.  VQD: Visual Query Detection In Natural Scenes , 2019, NAACL.

[55]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.