Text-Guided Attention Model for Image Captioning

Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar-based learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS-COCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics.

[1]  Bohyung Han,et al.  Training Recurrent Answering Units with Joint Loss Minimization for VQA , 2016, ArXiv.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[4]  Ivan Laptev,et al.  ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization , 2016, ECCV.

[5]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Seunghoon Hong,et al.  Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[9]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[13]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[15]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[18]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Gary Lupyan,et al.  Making the Invisible Visible: Verbal but Not Visual Cues Enhance Visual Detection , 2010, PloS one.

[20]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[21]  Ramesh Kumar Mishra Interaction Between Attention and Language Systems in Humans , 2015 .

[22]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xu Wei,et al.  Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[27]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.