Multimodal Image Captioning Through Combining Reinforced Cross Entropy Loss and Stochastic Deprecation

Recently, Cross Entropy Loss (CEL) has been proved to be useful in encoder-decoder based multimodal image captioning; however, it still faces the difficulty of inconsistency between optimizing function and evaluation metrics. In this paper, we propose a new approach for multimodal image captioning. It consists of 1) Reinforced Cross Entropy Loss (RCEL) to maximize the probability of ground truth captions and optimize evaluation metrics directly, and 2) Stochastic Deprecation (SD) to automatically select high-quality ground truth sentences without losing the diversity of corpus. The proposed RCEL and SD are generic and can improve the existing natural language generation models while combining them (RCEL-SD) can achieve the best result. Experimental results on the benchmark MSCOCO dataset show that the proposed RCEL-SD respectively outperforms CEL in terms of all the 7 evaluation metrics on three recent image captioning models.

[1]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[3]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Garrison W. Cottrell,et al.  Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[11]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[12]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[16]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[17]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[18]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.