The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue

This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising both their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected. To further illustrate the novel features of the dataset, we propose a baseline model for reference resolution which uses a simple method to take into account shared information accumulated in a reference chain. Our results show that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction.

[1]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[2]  Rachel Ryskin,et al.  People as contexts in conversation , 2015 .

[3]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[4]  Assertion , 2008, Practices of Reason.

[5]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Manuel Blum,et al.  Verbosity: a game for collecting common-sense facts , 2006, CHI.

[7]  Chris Callison-Burch,et al.  A Data-Driven Analysis of Workers' Earnings on Amazon Mechanical Turk , 2017, CHI.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  R. Krauss,et al.  Changes in reference phrases as a function of frequency of usage in social interaction: a preliminary study , 1964 .

[12]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[13]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[14]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[16]  R. Krauss,et al.  Concurrent feedback, confirmation, and the encoding of referents in verbal communication. , 1966, Journal of personality and social psychology.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  H. H. Clark,et al.  Conceptual pacts and lexical choice in conversation. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[19]  David Schlangen,et al.  The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description , 2018, INLG.

[20]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  吉田 悦子 Referring expressions in English and Japanese : patterns of use in dialogue processing , 2011 .

[22]  Philip R. Cohen,et al.  Referring as a Collaborative Process , 2003 .

[23]  Percy Liang,et al.  Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings , 2017, ACL.

[24]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[27]  Takenobu Tokunaga,et al.  The REX corpora: A collection of multimodal corpora of referring expressions in collaborative problem solving dialogues , 2012, LREC.

[28]  Etsuko Yoshida Referring Expressions in English and Japanese: Patterns of Use in Dialogue Processing , 2011 .

[29]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[30]  David DeVault,et al.  PentoRef: A Corpus of Spoken References in Task-oriented Dialogues , 2016, LREC.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[33]  Samy Bengio,et al.  Context-Aware Captions from Context-Agnostic Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).