Image Captioning with Memorized Knowledge

Image captioning, which aims to automatically generate text description of given images, has received much attention from researchers. Most existing approaches adopt a recurrent neural network (RNN) as a decoder to generate captions conditioned on the input image information. However, traditional RNNs deal with the sequence in a recurrent way, squeezing the information of all previous words into hidden cells and updating the context information by fusing the hidden states with the current word information. This may miss the rich knowledge too far in the past. In this paper, we propose a memory-enhanced captioning model for image captioning. We firstly introduce an external memory to store the past knowledge, i.e., all the information of generated words. When predicting the next word, the decoder can retrieve knowledge information about the past by means of a selective reading mechanism. Furthermore, to better explore the knowledge stored in the memory, we introduce several variants that consider different types of past knowledge. To verify the effectiveness of the proposed model, we conduct extensive experiments and comparisons on the well-known image captioning dataset MS COCO. Compared with the state-of-the-art captioning models, the proposed memory-enhanced captioning model shows a significant improvement in terms of the performance (improving 3.5% in terms of CIDEr). The proposed memory-enhanced captioning model, as demonstrated in the experiments, is more effective and superior to the state-of-the-art methods.

[1]  Rama Chellappa,et al.  Learning Common and Feature-Specific Patterns: A Novel Multiple-Sparse-Representation-Based Tracker , 2018, IEEE Transactions on Image Processing.

[2]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zhaoxiang Zhang,et al.  Hierarchical Convolutional Neural Networks for EEG-Based Emotion Recognition , 2017, Cognitive Computation.

[4]  Ling Shao,et al.  Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval , 2019, IEEE Transactions on Image Processing.

[5]  Rama Chellappa,et al.  Joint Sparse Representation and Robust Feature-Level Fusion for Multi-Cue Visual Tracking , 2015, IEEE Transactions on Image Processing.

[6]  Pushmeet Kohli,et al.  Memory-augmented Attention Modelling for Videos , 2016, ArXiv.

[7]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[11]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[12]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Nannan Li,et al.  Image Captioning with Visual-Semantic LSTM , 2018 .

[15]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[16]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[18]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[22]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[23]  Qun Liu,et al.  Memory-enhanced Decoder for Neural Machine Translation , 2016, EMNLP.

[24]  Shengping Zhang,et al.  Modality-correlation-aware sparse representation for RGB-infrared object tracking , 2020, Pattern Recognit. Lett..

[25]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[28]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Changshui Zhang,et al.  Aligning where to see and what to tell: image caption with region-based attention and scene factorization , 2015, ArXiv.

[31]  Yan Liu,et al.  Extreme Learning Machine for Huge Hypotheses Re-ranking in Statistical Machine Translation , 2017, Cognitive Computation.

[32]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[33]  Hui Chen,et al.  Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning , 2018, IJCAI.

[34]  Qiang Liu,et al.  Neural Image Caption Generation with Weighted Training and Reference , 2018, Cognitive Computation.

[35]  Ling Shao,et al.  End-to-End Feature-Aware Label Space Encoding for Multilabel Classification With Many Classes , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[38]  James E. Johnson,et al.  Approaches to Early Childhood Education , 1987 .

[39]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[40]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[41]  A. Dodds,et al.  Rehabilitating Blind and Visually Impaired People: A Psychological Approach , 1993 .

[42]  Jungong Han,et al.  Cross-View Retrieval via Probability-Based Semantics-Preserving Hashing , 2017, IEEE Transactions on Cybernetics.

[43]  Aurko Roy,et al.  Learning to Remember Rare Events , 2017, ICLR.

[44]  Qionghai Dai,et al.  DECODE: Deep Confidence Network for Robust Image Classification , 2019, IEEE Transactions on Image Processing.

[45]  Qiang Ni,et al.  Joint Image-Text Hashing for Fast Large-Scale Cross-Media Retrieval Using Self-Supervised Deep Learning , 2019, IEEE Transactions on Industrial Electronics.

[46]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[47]  Pong C. Yuen,et al.  Learning Modality-Consistency Feature Templates: A Robust RGB-Infrared Tracking System , 2019, IEEE Transactions on Industrial Electronics.

[48]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Hui Chen,et al.  Temporal-Difference Learning With Sampling Baseline for Image Captioning , 2018, AAAI.

[50]  Quan Pan,et al.  Learning Word Representations for Sentiment Analysis , 2017, Cognitive Computation.

[51]  Siqi Liu,et al.  Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[53]  Hui Chen,et al.  Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning , 2018, BICS.

[54]  Ye Yuan,et al.  Encode, Review, and Decode: Reviewer Module for Caption Generation , 2016, ArXiv.

[55]  Gregory Shakhnarovich,et al.  Discriminability Objective for Training Descriptive Captions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Kaizhu Huang,et al.  Reducing and Stretching Deep Convolutional Activation Features for Accurate Image Classification , 2018, Cognitive Computation.

[57]  Yue Gao,et al.  Large-Scale Cross-Modality Search via Collective Matrix Factorization Hashing , 2016, IEEE Transactions on Image Processing.

[58]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[60]  Qiang Liu,et al.  Reference Based LSTM for Image Captioning , 2017, AAAI.