Attribute-driven image captioning via soft-switch pointer

Abstract Visual attributes detection provides rich semantic concepts for image captioning. Some previous methods attempt to directly encode the attributes into vectors and generate the corresponding captions, which ignore the correlations between the image regions and attributes. In this paper, we consider to bridge the gap between visual features and detected attributes: first to look at a specific region of the image and second to decide which attribute to attend to. We propose an attribute-driven image captioning approach consisting of two parts: the visual positioning part and the attribute selection part. Specifically, we introduce the pointer-generator network into the second part of our model as a soft-switch, which determines whether to generate a word through the hidden state or point to a detected attribute at each decoding step. Qualitative and Quantitative experiments show that our model can improve the coverage of key visual attributes and significantly boost the overall performance.

[1]  Tao Mei,et al.  Pointing Novel Objects in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bowen Zhou,et al.  Pointing the Unknown Words , 2016, ACL.

[4]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yuxin Peng,et al.  Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation , 2020, IEEE Transactions on Image Processing.

[9]  Heng Tao Shen,et al.  More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining , 2019, IEEE Transactions on Image Processing.

[10]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[11]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[17]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Yuxin Peng,et al.  Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jing Liu,et al.  Normalized and Geometry-Aware Self-Attention Network for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[23]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Hui Chen,et al.  Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning , 2018, IJCAI.

[26]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[28]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.