Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

The automatic narration of a natural scene is an important trait in artificial intelligence that unites computer vision and natural language processing. Caption generation is a challenging task in scene understanding. Most of the state-of-the-art methods are using deep convolutional neural network models to extract visual features of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks for image captioning. However, in such models, only visual features are exploited for caption generation. This work investigated that fusion of text available in an image can give more fined-grained captioning of a scene. In this paper, we have proposed a model which incorporates a deep convolutional neural network and long short-term memory to boost the accuracy of image captioning by fusing text feature available in an image with the visual features extracted in state-of-the-art methods. We have validated the effectiveness of the proposed model on the benchmark datasets (Flickr8k and Flickr30k). The experimental outcomes illustrate that the proposed model outperformed the state-of-the-art methods for image captioning.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[3]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[4]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[9]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[10]  João Paulo Papa,et al.  Fine-tuning Deep Belief Networks using Harmony Search , 2016, Appl. Soft Comput..

[11]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[14]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Matthieu Cord,et al.  Learning Deep Hierarchical Visual Feature Coding , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[17]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[21]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[22]  Anand Singh Jalal,et al.  A robust model for salient text detection in natural scene images using MSER feature detector and Grabcut , 2018, Multimedia Tools and Applications.

[23]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.

[24]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[25]  Anand Singh Jalal,et al.  A comparison of visual attention models for the salient text content detection in natural scene , 2017, 2017 Conference on Information and Communication Technology (CICT).

[26]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Shuang Bai,et al.  Growing random forest on deep convolutional neural networks for scene categorization , 2017, Expert Syst. Appl..

[28]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Tatsuya Harada,et al.  Common Subspace for Model and Similarity: Phrase Learning for Caption Generation from Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Shifei Ding,et al.  Research on Point-wise Gated Deep Networks , 2017, Appl. Soft Comput..