Emotional Dialogue Generation using Image-Grounded Language Models

Computer-based conversational agents are becoming ubiquitous. However, for these systems to be engaging and valuable to the user, they must be able to express emotion, in addition to providing informative responses. Humans rely on much more than language during conversations; visual information is key to providing context. We present the first example of an image-grounded conversational agent using visual sentiment, facial expression and scene features. We show that key qualities of the generated dialogue can be manipulated by the features used for training the agent. We evaluate our model on a large and very challenging real-world dataset of conversations from social media (Twitter). The image-grounding leads to significantly more informative, emotional and specific responses, and the exact qualities can be tuned depending on the image features used. Furthermore, our model improves the objective quality of dialogue responses when evaluated on standard natural language metrics.

[1]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[2]  Meredith Ringel Morris,et al.  What do people ask their social networks, and why?: a survey study of status message q&a behavior , 2010, CHI.

[3]  Tao Chen,et al.  Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology , 2015, ACM Multimedia.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Louis-Philippe Morency,et al.  Affect-LM: A Neural Language Model for Customizable Affective Text Generation , 2017, ACL.

[6]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[7]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[8]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Quanzeng You,et al.  Sentiment and Emotion Analysis for Social Multimedia: Methodologies and Applications , 2016, ACM Multimedia.

[10]  Ran Zhao,et al.  Socially-Aware Animated Intelligent Personal Assistant Agent , 2016, SIGDIAL Conference.

[11]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[12]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[13]  Ning Wang,et al.  Creating Rapport with Virtual Agents , 2007, IVA.

[14]  Jean-Yves Antoine,et al.  Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation , 2014, EACL.

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Kasper Hornbæk,et al.  Current practice in measuring usability: Challenges to usability studies and research , 2006, Int. J. Hum. Comput. Stud..

[17]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[18]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[19]  Tatsuya Harada,et al.  Image Captioning with Sentiment Terms via Weakly-Supervised Sentiment Dataset , 2016, BMVC.

[20]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Kallirroi Georgila,et al.  SimSensei kiosk: a virtual human interviewer for healthcare decision support , 2014, AAMAS.

[23]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[24]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[25]  Jaime Teevan,et al.  Calendar.help: Designing a Workflow-Based Scheduling Agent with Humans in the Loop , 2017, CHI.

[26]  R. Alpert,et al.  Communications Through Limited-Response Questioning , 1954 .

[27]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[28]  Alan Ritter,et al.  Data-Driven Response Generation in Social Media , 2011, EMNLP.

[29]  Gregory A. Sanders,et al.  The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results , 2009, Machine Translation.

[30]  R. Feldman Applications of nonverbal behavioral theories and research , 2014 .

[31]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[32]  David Vandyke,et al.  Conditional Generation and Snapshot Learning in Neural Dialogue Systems , 2016, EMNLP.

[33]  Dina Utami,et al.  Improving Access to Online Health Information With Conversational Agents: A Randomized Controlled Experiment , 2016, Journal of medical Internet research.

[34]  Nicole Shechtman,et al.  Media inequality in conversation: how people behave differently when interacting with computers and people , 2003, CHI '03.

[35]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[36]  Ming Zhou,et al.  SuperAgent: A Customer Service Chatbot for E-commerce Websites , 2017, ACL.

[37]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  K. Kassam Assessment of emotional experience through facial expression , 2010 .

[39]  J. Cassell,et al.  Conversation as a System Framework : Designing Embodied Conversational Agents , 1999 .

[40]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[41]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[42]  Justine Cassell,et al.  Human conversation as a system framework: designing embodied conversational agents , 2001 .

[43]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[44]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[45]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[46]  Timothy W. Bickmore,et al.  Establishing and maintaining long-term human-computer relationships , 2005, TCHI.

[47]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Meredith Ringel Morris,et al.  "With most of it being pictures now, I rarely use it": Understanding Twitter's Evolving Accessibility to Blind Users , 2016, CHI.

[49]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[50]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[51]  R. Riggio Social interaction skills and nonverbal behavior. , 1992 .

[52]  Justine Cassell,et al.  Relational agents: a model and implementation of building user trust , 2001, CHI.

[53]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.