Neural Image Caption Generation with Weighted Training and Reference

Image captioning, which aims to automatically generate a sentence description for an image, has attracted much research attention in cognitive computing. The task is rather challenging, since it requires cognitively combining the techniques from both computer vision and natural language processing domains. Existing CNN-RNN framework-based methods suffer from two main problems: in the training phase, all the words of captions are treated equally without considering the importance of different words; in the caption generation phase, the semantic objects or scenes might be misrecognized. In our paper, we propose a method based on the encoder-decoder framework, named Reference based Long Short Term Memory (R-LSTM), aiming to lead the model to generate a more descriptive sentence for the given image by introducing reference information. Specifically, we assign different weights to the words according to the correlation between words and images during the training phase. We additionally maximize the consensus score between the captions generated by the captioning model and the reference information from the neighboring images of the target image, which can reduce the misrecognition problem. We have conducted extensive experiments and comparisons on the benchmark datasets MS COCO and Flickr30k. The results show that the proposed approach can outperform the state-of-the-art approaches on all metrics, especially achieving a 10.37% improvement in terms of CIDEr on MS COCO. By analyzing the quality of the generated captions, we come to a conclusion that through the introduction of reference information, our model can learn the key information of images and generate more trivial and relevant words for images.

[1]  J. De,et al.  A Psychological Approach , 1985 .

[2]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[3]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[4]  Jane S Owen Hutchinson Rehabilitating Blind and Visually Impaired People: A psychological approach , 1994 .

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[7]  Shengping Zhang,et al.  Robust Joint Discriminative Feature Learning for Visual Tracking , 2016, IJCAI.

[8]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[9]  Changshui Zhang,et al.  Aligning where to see and what to tell: image caption with region-based attention and scene factorization , 2015, ArXiv.

[10]  Pong C. Yuen,et al.  Multi-cue Visual Tracking Using Robust Feature-Level Fusion Based on Joint Sparse Representation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[12]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[13]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Qiang Liu,et al.  Reference Based LSTM for Image Captioning , 2017, AAAI.

[15]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Rama Chellappa,et al.  Robust MIL-Based Feature Template Learning for Object Tracking , 2017, AAAI.

[22]  Quan Pan,et al.  Learning Word Representations for Sentiment Analysis , 2017, Cognitive Computation.

[23]  Yan Liu,et al.  Extreme Learning Machine for Huge Hypotheses Re-ranking in Statistical Machine Translation , 2017, Cognitive Computation.

[24]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[25]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Yue Gao,et al.  Large-Scale Cross-Modality Search via Collective Matrix Factorization Hashing , 2016, IEEE Transactions on Image Processing.

[28]  Siqi Liu,et al.  Optimization of image description metrics using policy gradient methods , 2016, ArXiv.

[29]  Diego Reforgiato Recupero,et al.  Sentilo: Frame-Based Sentiment Analysis , 2014, Cognitive Computation.

[30]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Yue Gao,et al.  Predicting Personalized Image Emotion Perceptions in Social Networks , 2018, IEEE Transactions on Affective Computing.

[32]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[33]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[34]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[35]  Shuzhi Sam Ge,et al.  Image tag completion via dual-view linear sparse reconstructions , 2014, Comput. Vis. Image Underst..

[36]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[37]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[38]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  James E. Johnson,et al.  Approaches to Early Childhood Education , 1987 .

[40]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Rama Chellappa,et al.  Joint Sparse Representation and Robust Feature-Level Fusion for Multi-Cue Visual Tracking , 2015, IEEE Transactions on Image Processing.

[42]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[43]  Yue Gao,et al.  Continuous Probability Distribution Prediction of Image Emotions via Multitask Shared Sparse Regression , 2017, IEEE Transactions on Multimedia.

[44]  John G. Taylor,et al.  Saliency, Attention, Active Visual Search, and Picture Scanning , 2011, Cognitive Computation.

[45]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[46]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Holger Schwenk,et al.  Continuous Space Translation Models for Phrase-Based Statistical Machine Translation , 2012, COLING.

[48]  Zhaoxiang Zhang,et al.  Hierarchical Convolutional Neural Networks for EEG-Based Emotion Recognition , 2017, Cognitive Computation.

[49]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[50]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[51]  Hongxun Yao,et al.  Predicting discrete probability distribution of image emotions , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[52]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[53]  Yue Gao,et al.  Approximating Discrete Probability Distribution of Image Emotions by Multi-Modal Features Fusion , 2017, IJCAI.

[54]  Michael W. Spratling A Hierarchical Predictive Coding Model of Object Recognition in Natural Images , 2016, Cognitive Computation.

[55]  Kaizhu Huang,et al.  Reducing and Stretching Deep Convolutional Activation Features for Accurate Image Classification , 2018, Cognitive Computation.

[56]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[57]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[58]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[59]  Bin Luo,et al.  CLASS: Collaborative Low-Rank and Sparse Separation for Moving Object Detection , 2017, Cognitive Computation.

[60]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[62]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[63]  Yue Gao,et al.  Exploring Principles-of-Art Features For Image Emotion Recognition , 2014, ACM Multimedia.

[64]  Rama Chellappa,et al.  Learning Common and Feature-Specific Patterns: A Novel Multiple-Sparse-Representation-Based Tracker , 2018, IEEE Transactions on Image Processing.

[65]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[66]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[69]  Ling Shao,et al.  Learning Short Binary Codes for Large-scale Image Retrieval , 2017, IEEE Transactions on Image Processing.

[70]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[71]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[72]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[73]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[74]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.