Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a downstream task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs. In the subsequent step, this learned representation is aligned with the caption. Our key contribution lies in building this "caption-conditioned" image encoding, which tightly couples both the tasks and allows the weak supervision to effectively guide visual grounding. We provide extensive empirical and qualitative analysis to investigate the different components of our proposed model and compare it with competitive baselines. For phrase localization, we report an improvement of 4.9% and 1.3% (absolute) over the prior state-of-the-art on the VisualGenome and Flickr30k Entities datasets. We also report results that are at par with the state-of-the-art on the downstream caption-to-image retrieval task on COCO and Flickr30k datasets.

[1]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[2]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[3]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[5]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  José M. F. Moura,et al.  Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog , 2017, EMNLP.

[8]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[11]  Ajay Divakaran,et al.  Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention , 2018, ArXiv.

[12]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[13]  Shih-Fu Chang,et al.  Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  T. Tuytelaars,et al.  Weakly Supervised Object Detection with Posterior Regularization , 2014 .

[15]  Ramakant Nevatia,et al.  Knowledge Aided Consistency for Weakly Supervised Phrase Grounding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  C. Qi Deep Learning on Point Sets for 3 D Classification and Segmentation , 2016 .

[17]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[18]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[19]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Jieping Ye,et al.  Adaptive Distance Metric Learning for Clustering , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[23]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[24]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yong Jae Lee,et al.  Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[28]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[29]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[31]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[32]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[35]  Martin Engilberge,et al.  Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[38]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[39]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[40]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[41]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[43]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Nuno Vasconcelos,et al.  Multiple instance learning for soft bags via top instances , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[47]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).