A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents’ verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Mitesh M. Khapra,et al.  Towards Building Large Scale Multimodal Domain-Aware Conversation Systems , 2017, AAAI.

[3]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[5]  Daniel McDuff,et al.  Emotional Dialogue Generation using Image-Grounded Language Models , 2018, CHI.

[6]  Nanning Zheng,et al.  Where and Why are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Dong-Yan Huang,et al.  Audio-visual emotion recognition using deep transfer learning and multiple temporal models , 2017, ICMI.

[10]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Chloé Clavel,et al.  UE-HRI: a new dataset for the study of user engagement in spontaneous human-robot interactions , 2017, ICMI.

[13]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[14]  James M. Rehg,et al.  Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency , 2018, ECCV.

[15]  Tat-Seng Chua,et al.  Knowledge-aware Multimodal Dialogue Systems , 2018, ACM Multimedia.

[16]  Qi Wu,et al.  The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  José M. F. Moura,et al.  Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Michael Neff,et al.  A Corpus of Gesture-Annotated Dialogues for Monologue-to-Dialogue Generation from Personal Narratives , 2016, LREC.

[21]  Joelle Pineau,et al.  The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[22]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[23]  Takio Kurita,et al.  Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning , 2017, EURASIP J. Image Video Process..

[24]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[25]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[26]  Jason Weston,et al.  Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[27]  Koichi Shinoda,et al.  Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances , 2018, IJCAI.

[28]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .