An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions

Video captioning is currently considered to be one of the simplest ways to index and search data efficiently. In today’s era, suitable captioning of video images can be facilitated with deep learning architectures. The focus of past research has been on providing image captions; however, the generation of high-quality captions with suitable semantics for different scenes has not yet been achieved. Therefore, this work aims to generate well-defined and meaningful captions to images and videos by using convolutional neural networks (CNN) and recurrent neural networks in combination. Beginning with the available dataset, features of images and videos were extracted using CNN. The extracted feature vectors were then utilized to generate a language model with the involvement of long short-term memory for individual word grams. The generated meaningful captions were trained using a softmax function, for performance computation using some predefined evaluation metrics. The obtained experimental results demonstrate that the proposed model outperforms existing benchmark models.

[1]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Aditya Khamparia,et al.  Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network , 2019, IEEE Access.

[4]  Qian Wang,et al.  A Novel Method of Signal Fusion Based on Dimension Expansion , 2018, Circuits Syst. Signal Process..

[5]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Joel J. P. C. Rodrigues,et al.  Effective Features to Classify Big Data Using Social Internet of Things , 2018, IEEE Access.

[8]  Aditya Khamparia,et al.  Effects of visual map embedded approach on students learning performance using Briggs-Myers learning style in word puzzle gaming course , 2018, Comput. Electr. Eng..

[9]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[11]  Mohammad Yavari,et al.  An Automatic Action Potential Detector for Neural Recording Implants , 2019, Circuits Syst. Signal Process..

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Deepak Gupta,et al.  Remote File Synchronization Single-Round Algorithms , 2010 .

[15]  Joel J. P. C. Rodrigues,et al.  Usability feature extraction using modified crow search algorithm: a novel approach , 2018, Neural Computing and Applications.

[16]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[18]  Aditya Khamparia,et al.  A comprehensive survey of edge prediction in social networks: Techniques, parameters and challenges , 2019, Expert Syst. Appl..

[19]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[20]  Mahdiar Hosein Ghadiry,et al.  DLPA: Discrepant Low PDP 8-Bit Adder , 2013, Circuits Syst. Signal Process..

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Aditya Khamparia,et al.  A novel deep learning-based multi-model ensemble method for the prediction of neuromuscular disorders , 2018, Neural Computing and Applications.

[23]  Quan Liu,et al.  Multi-view pedestrian captioning with an attention topic CNN model , 2018, Comput. Ind..

[24]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Deepak Gupta,et al.  Usability Prediction of ‘Live Auction’ Using Multistage Fuzzy System , 2017 .

[27]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Deepak Gupta,et al.  Usability feature selection via MBBAT: A novel approach , 2017, J. Comput. Sci..

[29]  Aditya Khamparia,et al.  Seasonal Crops Disease Prediction and Classification Using Deep Convolutional Encoder Network , 2019, Circuits, Systems, and Signal Processing.

[30]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[31]  Kalpna Sagar,et al.  Usability Prediction & Ranking of SDLC Models Using Fuzzy Hierarchical Usability Model , 2017 .

[32]  Aditya Khamparia,et al.  Investigating the Importance of Psychological and Environmental Factors for Improving Learner’s Performance Using Hidden Markov Model , 2019, IEEE Access.