Image captioning with deep LSTM based on sequential residual

Image captioning is a fundamental task which requires semantic understanding of images and the ability of generating description sentences with proper and correct structure. In consideration of the problem that language models are always shallow in modern image caption frameworks, a deep residual recurrent neural network is proposed in this work with the following two contributions. First, an easy-to-train deep stacked Long Short Term Memory (LSTM) language model is designed to learn the residual function of output distributions by adding identity mappings to multi-layer LSTMs. Second, in order to overcome the over-fitting problem caused by larger-scale parameters in deeper LSTM networks, a novel temporal Dropout method is proposed into LSTM. The experimental results on the benchmark MSCOCO and Flickr30K datasets demonstrate that the proposed model achieves the state-of-the-art performances with 101.1 in CIDEr on MSCOCO and 22.9 in B-4 on Flickr30K, respectively.

[1]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[6]  Abd El Rahman Shabayek,et al.  Training Very Deep Networks via Residual Learning with Stochastic Input Shortcut Connections , 2017, ICONIP.

[7]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  David A. Forsyth,et al.  Swapout: Learning an ensemble of deep architectures , 2016, NIPS.

[10]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[14]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[19]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[20]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[21]  Qi Tian,et al.  DisturbLabel: Regularizing CNN on the Loss Layer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.