Generating Affective Captions using Concept And Syntax Transition Networks

The area of image captioning i.e. the automatic generation of short textual descriptions of images has experienced much progress recently. However, image captioning approaches often only focus on describing the content of the image without any emotional or sentimental dimension which is common in human captions. This paper presents an approach for image captioning designed specifically to incorporate emotions and feelings into the caption generation process. The presented approach consists of a Deep Convolutional Neural Network (CNN) for detecting Adjective Noun Pairs in the image and a graphical network architecture called "Concept And Syntax Transition (CAST)" network for generating sentences from these detected concepts.

[1]  Andreas Dengel,et al.  Introducing Concept And Syntax Transition Networks for Image Captioning , 2016, ICMR.

[2]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[3]  Tao Chen,et al.  DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[6]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[8]  Adrian Ulges,et al.  Visual Concept Learning from Weakly Labeled Web Videos , 2010, Video Search and Mining.

[9]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[10]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[12]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[13]  R. Zajonc Feeling and thinking : Preferences need no inferences , 1980 .

[14]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[15]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[16]  Adrian Ulges,et al.  Pornography detection in video benefits (a lot) from a multi-modal approach , 2012, AMVA '12.

[17]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[19]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.