Engaging Image Captioning via Personality

Standard image captioning tasks such as COCO and Flickr30k are factual, neutral in tone and (to a human) state the obvious (e.g., “a man playing a guitar”). While such tasks are useful to verify that a machine understands the content of an image, they are not engaging to humans as captions. With this in mind we define a new task, PERSONALITY-CAPTIONS, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits. We collect and release a large dataset of 241,858 of such captions conditioned over 215 possible traits. We build models that combine existing work from (i) sentence representations [36] with Transformers trained on 1.7 billion dialogue examples; and (ii) image representations [32] with ResNets trained on 3.5 billion social media images. We obtain state-of-the-art performance on Flickr30k and COCO, and strong performance on our new task. Finally, online evaluations validate that our task and models are engaging to humans, with our best model close to human performance.

[1]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[2]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[3]  Sergio Escalera,et al.  First Impressions: A Survey on Computer Vision-Based Apparent Personality Trait Analysis , 2018, ArXiv.

[4]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jiebo Luo,et al.  Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions , 2018, ArXiv.

[6]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[8]  Pascale Fung,et al.  Adapting a Virtual Agent to User Personality , 2017, IWSDS.

[9]  Timothy Jay,et al.  Filling the emotion gap in linguistic theory: Commentary on Potts' expressive dimension , 2007 .

[10]  Martin Engilberge,et al.  Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sébastien Marcel,et al.  Torchvision the machine-vision package of torch , 2010, ACM Multimedia.

[16]  Devi Parikh,et al.  Punny Captions: Witty Wordplay in Image Descriptions , 2017, NAACL.

[17]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[20]  José M. F. Moura,et al.  Visual Dialog , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[22]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[23]  Christos Faloutsos,et al.  Automatic image captioning , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[24]  Matthias Scheutz,et al.  The utility of affect expression in natural language interactions in joint human-robot tasks , 2006, HRI '06.

[25]  Vicki L. Campbell,et al.  The Sixteen Personality Factor Questionnaire (16PF) , 2012 .

[26]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[27]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Antoine Bordes,et al.  Training Millions of Personalized Dialogue Agents , 2018, EMNLP.

[30]  Kota Yoshida,et al.  Neural Joking Machine : Humorous image captioning , 2018, ArXiv.

[31]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[32]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[33]  Lexing Xie,et al.  SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[35]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[36]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Rafał Jończyk Affect-Language Interactions in Native and Non-Native English Speakers , 2016 .

[38]  A. Abele,et al.  Agency and communion from the perspective of self versus others. , 2007, Journal of personality and social psychology.

[39]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Gunhee Kim,et al.  Attend to You: Personalized Image Captioning with Context Sequence Memory Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[45]  Emily Denton,et al.  User Conditional Hashtag Prediction for Images , 2015, KDD.

[46]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[47]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[50]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[51]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[52]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[53]  Subbarao Kambhampati,et al.  What We Instagram: A First Analysis of Instagram Photo Content and User Types , 2014, ICWSM.

[54]  Alan D. Mead,et al.  The Sixteen Personality Factor Questionnaire (16PF). , 2008 .

[55]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[56]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[57]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Aviv Eisenschtat,et al.  Capturing Deep Correlations with 2-Way Nets , 2016, ArXiv.

[59]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[60]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[61]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.