avtmNet: Adaptive Visual-Text Merging Network for Image Captioning

Abstract Recently, researchers have made extensive research on the technology of automatically generating descriptions for an image. Various technologies for image captioning have been proposed, among which attention-based encoder-decoder framework achieved great success. Two different types of attention models are proposed to generate image captions respectively, i.e., model based visual attention that is good at describing details, and model based text attention that is good at comprehensive understanding. In order to integrate and make full use of visual information and text information to generate more accurate captions for images, in this paper, we firstly introduce a visual attention model to generate the visual information and a text attention model to form the text information respectively, and then propose an adaptive visual-text merging network(avtmNet). This merging network can effectively merge the visual information and text information, and automatically determine the proportion of both visual information and text information to generate the next caption word. Extensive experiments are performed on the datasets named COCO2014 and Flickr30K respectively, and show the effectiveness and superiority of our proposed approach.

[1]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[4]  Shuang Bai,et al.  A survey on automatic image caption generation , 2018, Neurocomputing.

[5]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[10]  Xu Wei,et al.  Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Ronan Collobert,et al.  Simple Image Description Generator via a Linear Phrase-Based Approach , 2014, ICLR.

[17]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[18]  Ye Yuan,et al.  Encode, Review, and Decode: Reviewer Module for Caption Generation , 2016, ArXiv.

[19]  Garrison W. Cottrell,et al.  Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Oliver Schulte,et al.  Image Caption Generation with Hierarchical Contextual Visual Spatial Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Huchuan Lu,et al.  Deep visual tracking: Review and experimental comparison , 2018, Pattern Recognit..

[25]  Nannan Li,et al.  Meta Learning for Image Captioning , 2019, AAAI.

[26]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Shaomei Wu,et al.  Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service , 2017, CSCW.

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[33]  Zhiqing Lin,et al.  Image Caption via Visual Attention Switch on DenseNet , 2018, 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC).