Image captioning using DenseNet network and adaptive attention

Abstract Considering the image captioning problem, it is difficult to correctly extract the global features of the images. At the same time, most attention methods force each word to correspond to the image region, ignoring the phenomenon that words such as “the” in the description text cannot correspond to the image region. To address these problems, an adaptive attention model with a visual sentinel is proposed in this paper. In the encoding phase, the model introduces DenseNet to extract the global features of the image. At the same time, on each time axis, the sentinel gate is set by the adaptive attention mechanism to decide whether to use the image feature information for word generation. In the decoding phase, the long short-term memory (LSTM) network is applied as a language generation model for image captioning tasks to improve the quality of image caption generation. Experiments on the Flickr30k and COCO datasets indicate that the proposed model exhibits significant improvement in terms ofthe BLEU and METEOR evaluation criteria.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Filiberto Pla,et al.  Sparse multi-modal probabilistic latent semantic analysis for single-image super-resolution , 2018, Signal Process..

[3]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[4]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[5]  Zhiqing Lin,et al.  Image Caption via Visual Attention Switch on DenseNet , 2018, 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC).

[6]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[7]  Meng Wang,et al.  DADNet: Dilated-Attention-Deformable ConvNet for Crowd Counting , 2019, ACM Multimedia.

[8]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[9]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[10]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[12]  Yang Yang,et al.  VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation , 2019, Neurocomputing.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Qing Wang,et al.  Transferred deep learning based waveform recognition for cognitive passive radar , 2019, Signal Process..

[15]  Xudong Jiang,et al.  Feature fusion with covariance matrix regularization in face recognition , 2018, Signal Process..

[16]  Shuo Wang,et al.  Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[17]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[18]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.