Hybrid Image Captioning Model

Image captioning is implemented using Deep learning and NLP (Natural Language Processing) resulting in producing a description of an image. The proposed model generates a caption for an image using a Convolutional Neural Network (CNN) together with a Recurrent Neural Network (RNN) and area of attention. Previously, the image names were used as keys to map the images with descriptions. In order to achieve high performance, in the proposed model the image caption is based on the relationship between the areas of a picture (attention model), the words used in the caption, and the state of an RNN language model. The approach of progressive loading is employed for the loading of the image dataset. Further, for encoding the image dataset into a feature vector, VGG16 a pre-trained CNN is used. The extracted feature vector is given as input to the RNN model. These image encodings are output to a specific type of RNN model known as Long Short-Term Memory (LSTM) networks. Subsequently, the LSTM works on decoding the feature vector and predicts the sequence of words, resulting in the generation of descriptions or captions. The training performance is measured using one of the model’s quantitative analysis metrics known as BLEU.

[1]  Zhengkun Zhang,et al.  Generating news image captions with semantic discourse extraction and contrastive style-coherent learning , 2022, Comput. Electr. Eng..

[2]  Chunxiao Fan,et al.  MAENet: A novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning , 2022, Neurocomputing.

[3]  Zirong Zhai,et al.  ArCo: Attention-reinforced transformer with contrastive learning for image captioning , 2022, Image Vis. Comput..

[4]  Anish Banda,et al.  Image Captioning using CNN and LSTM , 2021, International Journal for Research in Applied Science and Engineering Technology.

[5]  Basant Tiwari,et al.  Automatic Generation of Chest X-Ray Medical Imaging Reports using LSTM-CNN , 2021, DSMLAI.

[6]  Jiebo Luo,et al.  Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation , 2020, IEEE Transactions on Image Processing.

[7]  K. Verma,et al.  Evaluation of Image Features Within and Surrounding Lesion Region for Risk Stratification in Breast Ultrasound Images , 2019, IETE Journal of Research.

[8]  Kesari Verma,et al.  Hybrid segmentation method based on multi-scale Gaussian kernel fuzzy clustering with spatial bias correction and region-scalable fitting for breast US images , 2018, IET Comput. Vis..

[9]  Kesari Verma,et al.  Integrating radiologist feedback with computer aided diagnostic systems for breast cancer risk prediction in ultrasonic images: An experimental investigation in machine learning paradigm , 2017, Expert Syst. Appl..

[10]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Keshri Verma,et al.  An enhancement in automatic seed selection in breast cancer ultrasound images using texture features , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[13]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[14]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[15]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[16]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[17]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Zhengxia Zou,et al.  Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset , 2022, IEEE Transactions on Geoscience and Remote Sensing.

[23]  Kesari Verma,et al.  Ultrasound image segmentation using a novel multi-scale Gaussian kernel fuzzy clustering and multi-scale vector field convolution , 2019, Expert Syst. Appl..

[24]  Kesari Verma,et al.  Automated Boundary Detection of Breast Cancer in Ultrasound Images Using Watershed Algorithm , 2018 .