Bidirectional LSTM approach to image captioning with scene features

Image captioning involves generating a sentence that describes an image. More recently, it has been driven by encoderdecoder approaches where the encoder such as convolutional neural network (CNN) can extract the visual features of an image. The extracted visual features are passed to a decoder such as a long short-term memory (LSTM) network in order to generate a sentence that describes the image. One major challenge with this approach is to precisely include the scene of an image in the generated sentences. To resolve this challenge, visual scene features have been used with unidirectional LSTM decoders. However, for long sentences, this limits the precision of the generated text. This research proposes a novel approach to generate sentences using visual scene information with a bidirectional LSTM decoder. The encoder is based on Inception v3 to extract the object features and Places365 to extract the scene features. The decoder uses a bidirectional LSTM to generate a sentence. The encoder-decoder model is trained using the Flickr8k dataset. Results show improved performance for generating longer sentences with a 9% increase in BLEU-3 and a 12% increase in BLEU-4 scores compared to compared to other encoder-decoder methods that are limited to only using global image features. Visually impaired people that use screen readers would benefit from this research as they would get an enhanced description of an image that includes the background scene thereby creating a wholesome picture in the mind of the reader.

[1]  Lei Gao,et al.  An image caption method based on object detection , 2019, Multimedia Tools and Applications.

[2]  Xin Wang,et al.  Description Generation for Remote Sensing Images Using Attribute Attention Mechanism , 2019, Remote. Sens..

[3]  Aditya Khamparia,et al.  An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions , 2020, Circuits Syst. Signal Process..

[4]  Christoph Meinel,et al.  Image Captioning with Deep Bidirectional LSTMs , 2016, ACM Multimedia.

[5]  Gui-Song Xia,et al.  Image Caption Generation with Part of Speech Guidance , 2017, Pattern Recognit. Lett..

[6]  Yong Zhang,et al.  Application of Dual Attention Mechanism in Chinese Image Captioning , 2020, Journal of Intelligent Learning Systems and Applications.

[7]  Jun Li,et al.  DAA: Dual LSTMs with adaptive attention for image captioning , 2019, Neurocomputing.

[8]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Arun Kumar Sangaiah,et al.  Image caption generation with high-level image features , 2019, Pattern Recognit. Lett..

[10]  Soo-Hyung Kim,et al.  The Role of Attention Mechanism and Multi-Feature in Image Captioning , 2019, ICMLSC 2019.

[11]  Rynson W. H. Lau,et al.  Image captioning via semantic element embedding , 2020, Neurocomputing.

[12]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[13]  Ning Wang,et al.  A survey on deep neural network-based image captioning , 2018, The Visual Computer.

[14]  Md. Zakir Hossain,et al.  A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..

[15]  Anand Singh Jalal,et al.  Integration of textual cues for fine-grained image captioning using deep CNN and LSTM , 2019, Neural Computing and Applications.

[16]  Shiming Xiang,et al.  Dense semantic embedding network for image captioning , 2019, Pattern Recognit..

[17]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[18]  Xiaosong Zhao,et al.  Image caption model of double LSTM with scene factors , 2019, Image Vis. Comput..

[19]  Burak Makav,et al.  Smartphone-based Image Captioning for Visually and Hearing Impaired , 2019, 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).

[20]  Nikhil Ketkar,et al.  Deep Learning with Python , 2017 .