Caption Generation is one of the fundamental tasks combining computer vision and natural language processing. To achieve this goal, neural networks are employed to implement a caption generation system. In this paper, we proposed a caption generation system combining a CNN-based object detection system and a language model with a recurrent neural network. Especially, a vector which is sent from the object detection system to the language model is generated using an attention mechanism. Attention visualization can help us to understand the system focuses on a part of the input image in generating a caption. In the experiments, we evaluate the performance of the proposed system and discuss the effects of the attention mechanism in the image caption. Especially, the attention contributes to the improvement of caption generation but the attention is uncorrelated to system interpretation.
[1]
Christopher D. Manning,et al.
Effective Approaches to Attention-based Neural Machine Translation
,
2015,
EMNLP.
[2]
Quoc V. Le,et al.
Sequence to Sequence Learning with Neural Networks
,
2014,
NIPS.
[3]
Yoshua Bengio,et al.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
,
2015,
ICML.
[4]
Andrew Zisserman,et al.
Very Deep Convolutional Networks for Large-Scale Image Recognition
,
2014,
ICLR.
[5]
Samy Bengio,et al.
Show and tell: A neural image caption generator
,
2014,
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6]
Yoshua Bengio,et al.
Neural Machine Translation by Jointly Learning to Align and Translate
,
2014,
ICLR.