Visual enhanced gLSTM for image captioning

Abstract For reducing the negative impact of the gradient diminishing on guiding long-short term memory (gLSTM) model in image captioning, we propose a visual enhanced gLSTM model for image caption generation. In this paper, the visual features of image’s region of interest (RoI) are extracted and used as guiding information in gLSTM, in which visual information of RoI is added to gLSTM for generating more accurate image captions. Two visual enhanced methods based on region and entire image are proposed respectively. Among them the visual features from the important semantic region by CNN and the full image visual features by visual words are extracted to guide the LSTM for generating the most important semantic words. Then the visual features and text features of similar images are respectively projected to the common semantic space to obtain visual enhancement guiding information by canonical correlation analysis, and added to each memory cell of gLSTM for generating caption words. Compared with the original gLSTM method, visual enhanced gLSTM model focuses on important semantic region, which is more in line with human perception of images. Experiments on Flickr8k dataset illustrate that the proposed method can achieve more accurate image captions, and outperform the baseline gLSTM algorithm and other popular image captioning methods.

[1]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[2]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[4]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jing Zhang,et al.  Image retrieval using the extended salient region , 2017, Inf. Sci..

[6]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[7]  Ankush Gupta,et al.  From Image Annotation to Image Description , 2012, ICONIP.

[8]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Yejin Choi,et al.  Generalizing Image Captions for Image-Text Parallel Corpus , 2013, ACL.

[11]  Chee Seng Chan,et al.  phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning , 2016, ACCV.

[12]  Changshui Zhang,et al.  Aligning where to see and what to tell: image caption with region-based attention and scene factorization , 2015, ArXiv.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Jing Zhang,et al.  Image region annotation based on segmentation and semantic correlation analysis , 2018, IET Image Process..

[16]  Christoph Meinel,et al.  Image Captioning with Deep Bidirectional LSTMs , 2016, ACM Multimedia.

[17]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[19]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[20]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[21]  Matei Mancas,et al.  Image perception : Relative influence of bottom-up and top-down attention , 2008 .

[22]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Fei Sha,et al.  Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jing Zhang,et al.  Representation of image content based on RoI-BoW , 2015, J. Vis. Commun. Image Represent..

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[27]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[28]  Jorma Laaksonen,et al.  Paying Attention to Descriptions Generated by Image Captioning Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Xu Jia,et al.  Guiding Long-Short Term Memory for Image Caption Generation , 2015, ArXiv.

[30]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[31]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[33]  Gunhee Kim,et al.  Attend to You: Personalized Image Captioning with Context Sequence Memory Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.