Refining Attention: A Sequential Attention Model for Image Captioning

Visual attention is widely applied to image captioning. Previous works put visual attention and linguistic word into a long short-term memory network together, but neglect the sequential relation of attention at different time steps during word prediction. Moreover, the abstraction degree of visual attention is usually different from that of linguistic word. To address these issues, a sequential attention model is proposed in this work to handle visual attention by considering the corresponding sequential relation, and hence the internal relation among attention at each word prediction step is well utilized to enhance the visual information during sentence decoding. The experimental results on the benchmark MSCOCO and Flickr30K datasets show that the proposed model achieves excellent performances with 108.1 and 34.9 respectively on the evaluation criteria of CIDEr and BLEU-4 for MSCOCO.

[1]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Zhejian Peng,et al.  Show and Tell: A Neural Image Caption Generator , 2018 .

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[9]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  Fuchun Sun,et al.  MAT: A Multimodal Attentive Translator for Image Captioning , 2017, IJCAI.

[12]  Fei Sha,et al.  Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[14]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Siqi Liu,et al.  Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[20]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[23]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[24]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.