论文信息 - Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.

Oliver Schulte | Mahmoud Khademi | O. Schulte | M. Khademi

[1] Tao Mei,et al. Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[2] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Alex Graves,et al. Grid Long Short-Term Memory , 2015, ICLR.

[6] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[7] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[8] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Xu Wei,et al. Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[14] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[16] Chunhua Shen,et al. What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[18] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19] Richard Socher,et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Ronan Collobert,et al. Phrase-based Image Captioning , 2015, ICML.

[21] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22] Richard Socher,et al. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[23] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[24] Xinlei Chen,et al. Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[25] M. Posner,et al. Orienting of Attention* , 1980, The Quarterly journal of experimental psychology.

[26] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Ye Yuan,et al. Review Networks for Caption Generation , 2016, NIPS.