Stimulus-driven and concept-driven analysis for image caption generation

Abstract Recently, image captioning has achieved great progress in computer vision and artificial intelligence. However, language models still failed to achieve the desired results in high-level visual tasks. Generating accurate image captions for a complex scene that contains multiple targets is a challenge. To solve these problems, we introduce the theory of attention in psychology to image caption generation. We propose two types of attention mechanisms: The stimulus-driven and the concept-driven. Our attention model relies on a combination of convolutional neural network (CNN) over images and long-short term memory (LSTM) network over sentences. Comparison of experimental results illustrates that our proposed method achieves good performance on the MSCOCO test server.

[1]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[4]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[5]  Paolo Bartolomeo,et al.  The Attention Systems of the Human Brain , 2014 .

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[9]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  G. Mangun,et al.  The neural mechanisms of top-down attentional control , 2000, Nature Neuroscience.

[13]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[15]  Jun Yu,et al.  Multitask Autoencoder Model for Recovering Human Poses , 2018, IEEE Transactions on Industrial Electronics.

[16]  Jun Yu,et al.  Click Prediction for Web Image Reranking Using Multimodal Sparse Coding , 2014, IEEE Transactions on Image Processing.

[17]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[18]  Lei Cai,et al.  Two-archive method for aggregation-based many-objective optimization , 2018, Inf. Sci..

[19]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[20]  Yupu Yang,et al.  Modeling coverage with semantic embedding for image caption generation , 2018, The Visual Computer.

[21]  S. Nieuwenhuis,et al.  Neural mechanisms of attention and control: losing our inhibitions? , 2005, Nature Neuroscience.

[22]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  J. Movshon,et al.  Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. , 2002, Journal of neurophysiology.

[24]  D. Heeger,et al.  When size matters: attention affects performance by contrast or response gain , 2010, Nature Neuroscience.

[25]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[26]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[27]  Shaohua Wan,et al.  A long video caption generation algorithm for big video data retrieval , 2019, Future Gener. Comput. Syst..

[28]  P. Kay,et al.  Basic Color Terms: Their Universality and Evolution , 1973 .

[29]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jianping Fan,et al.  iPrivacy: Image Privacy Protection by Identifying Sensitive Objects via Deep Multi-Task Learning , 2017, IEEE Transactions on Information Forensics and Security.

[32]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[33]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[35]  Yongdong Zhang,et al.  GLA: Global–Local Attention for Image Description , 2018, IEEE Transactions on Multimedia.

[36]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[37]  Nitish Srivastava,et al.  Learning Generative Models with Visual Attention , 2013, NIPS.

[38]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Hans-Hellmut Nagel,et al.  Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[41]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[42]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[43]  Cordelia Schmid,et al.  Learning Color Names for Real-World Applications , 2009, IEEE Transactions on Image Processing.

[44]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[45]  Jianping Fan,et al.  Leveraging Content Sensitiveness and User Trustworthiness to Recommend Fine-Grained Privacy Settings for Social Image Sharing , 2018, IEEE Transactions on Information Forensics and Security.

[46]  Arun Kumar Sangaiah,et al.  Image caption generation with high-level image features , 2019, Pattern Recognit. Lett..