A topic-based multi-channel attention model under hybrid mode for image caption

Automatically generating captions of an image is not closely related to every spatial area of the visual information, but always related to the topic of the image expression. Aiming at the decoupling problem of visual spatial feature attention and semantic decoder, a topic-based multi-channel attention model (TMA) under hybrid mode for image caption is proposed. First, natural language processing (NLP) technology is used to preprocess the caption references, including filtering stop words, analyzing word frequency and constructing a semantic network graph with node labels. Then, combined with the image features extracted by the convolutional neural network (CNN), a semantic perception network is designed to achieve cross-domain prediction from image to topic. Next, a topic-based multi-channel attention fusion mechanism is proposed to realize image-text attention fusion representation under the joint action of the global spatial features of the image, the local semantic features of the graph nodes and the hidden layer features of the long short-term memory (LSTM) decoder. Finally, multi-task loss function is used to train the TMA. Experimental results show that the proposed model has better evaluation performance with topic-focused attention than state-of-the-art (SOTA) methods.

[1]  Reverse-Engineering the Cortical Architecture for Controlled Semantic Cognition , 2021, Nature human behaviour.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Yansong Feng,et al.  Automatic Caption Generation for News Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[5]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Oliver Schulte,et al.  Image Caption Generation with Hierarchical Contextual Visual Spatial Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  Jennifer S. Raj,et al.  RECURRENT NEURAL NETWORKS AND NONLINEAR PREDICTION IN SUPPORT VECTOR MACHINES , 2019, Journal of Soft Computing Paradigm.

[8]  Chunwei Tian,et al.  Design and implementation on matching between music and color , 2021, Multimedia Tools and Applications.

[9]  Chien-Li Chou,et al.  Effective Semantic Annotation by Image-to-Concept Distribution Model , 2011, IEEE Transactions on Multimedia.

[10]  Weili Guan,et al.  Image caption generation with dual attention mechanism , 2020, Inf. Process. Manag..

[11]  Timothy T. Rogers,et al.  Reverse-Engineering the Cortical Architecture for Controlled Semantic Cognition , 2019, Nature Human Behaviour.

[12]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ani Nenkova,et al.  The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization , 2019, EMNLP.

[14]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[15]  Jun Xiao,et al.  Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention , 2020, Multimedia Tools and Applications.

[16]  Vladimir Pavlovic,et al.  Baselines for Image Annotation , 2010, International Journal of Computer Vision.

[17]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[18]  Yang Yang,et al.  VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation , 2019, Neurocomputing.

[19]  Miguel A. Atencia Ruiz,et al.  Advances in computational intelligence , 2019, Neural Computing and Applications.

[20]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[21]  Vijayan K. Asari,et al.  Improved inception-residual convolutional neural network for object recognition , 2017, Neural Computing and Applications.

[22]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[23]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[24]  Saban Öztürk,et al.  Class-driven content-based medical image retrieval using hash codes of deep features , 2021, Biomed. Signal Process. Control..

[25]  T. Rogers,et al.  The neural and computational bases of semantic cognition , 2016, Nature Reviews Neuroscience.

[26]  Lei Li,et al.  Towards Making the Most of BERT in Neural Machine Translation , 2020, AAAI.

[27]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[28]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Arun Kumar Sangaiah,et al.  Image caption generation with high-level image features , 2019, Pattern Recognit. Lett..

[31]  Xiaoyu Yang,et al.  Enhancing Unsupervised Pretraining with External Knowledge for Natural Language Inference , 2019, Canadian Conference on AI.

[32]  Alberto Del Bimbo,et al.  A Cross-media Model for Automatic Image Annotation , 2014, ICMR.

[33]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xindong Wu,et al.  Object Detection With Deep Learning: A Review , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[35]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[36]  The general fault in our fault lines. , 2021, Nature human behaviour.

[37]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[38]  Tianrui Li,et al.  Multivariate time series forecasting via attention-based encoder-decoder framework , 2020, Neurocomputing.

[39]  Şaban Öztürk,et al.  Stacked auto-encoder based tagging with deep features for content-based medical image retrieval , 2020, Expert Syst. Appl..

[40]  Şaban Öztürk,et al.  Convolutional neural network based dictionary learning to create hash codes for content-based image retrieval , 2021 .

[41]  Shuang Bai,et al.  A survey on automatic image caption generation , 2018, Neurocomputing.

[42]  Zi Huang,et al.  Human Consensus-Oriented Image Captioning , 2020, IJCAI.

[43]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[44]  Lei Tian,et al.  Image robust recognition based on feature-entropy-oriented differential fusion capsule network , 2020, Appl. Intell..

[45]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.