Refer, Reuse, Reduce: Grounding Subsequent References in Visual and Conversational Contexts

Dialogue participants often refer to entities or situations repeatedly within a conversation, which contributes to its cohesiveness. Subsequent references exploit the common ground accumulated by the interlocutors and hence have several interesting properties, namely, they tend to be shorter and reuse expressions that were effective in previous mentions. In this paper, we tackle the generation of first and subsequent references in visually grounded dialogue. We propose a generation model that produces referring utterances grounded in both the visual and the conversational context. To assess the referring effectiveness of its output, we also implement a reference resolution system. Our experiments and analyses show that the model produces better, more effective referring utterances than a model not grounded in the dialogue context, and generates subsequent references that exhibit linguistic patterns akin to humans.

[1]  Verena Rieser,et al.  History for Visual Dialog: Do we really need it? , 2020, ACL.

[2]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Gabriel Skantze,et al.  Using Lexical Alignment and Referring Ability to Address Data Sparsity in Situated Dialog Reference Resolution , 2018, EMNLP.

[6]  Philip R. Cohen,et al.  Referring as a Collaborative Process , 2003 .

[7]  Eugene Charniak,et al.  Entropy Rate Constancy in Text , 2002, ACL.

[8]  Pushmeet Kohli,et al.  Jointly Learning "What" and "How" from Instructions and Goal-States , 2018, ICLR.

[9]  Philip H. S. Torr,et al.  Visual Dialogue without Vision or Dialogue , 2018, ArXiv.

[10]  M. Pickering,et al.  Toward a mechanistic psychology of dialogue , 2004, Behavioral and Brain Sciences.

[11]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ondrej Dusek,et al.  A Context-aware Natural Language Generator for Dialogue Systems , 2016, SIGDIAL Conference.

[13]  Kees van Deemter,et al.  Generating Expressions that Refer to Visible Objects , 2013, NAACL.

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Marilyn A. Walker,et al.  Entrainment in Pedestrian Direction Giving: How Many Kinds of Entrainment? , 2014, IWSDS.

[18]  Frank Keller,et al.  The Entropy Rate Principle as a Predictor of Processing Effort: An Evaluation against Eye-tracking Data , 2004, EMNLP.

[19]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[20]  Amanda Stent,et al.  Lexical and Syntactic Adaptation and Their Impact in Deployed Spoken Dialog Systems , 2009, NAACL.

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[23]  S. Brennan,et al.  When conceptual pacts are broken: Partner-specific effects on the comprehension of referring expressions , 2003 .

[24]  R. A. Nelson,et al.  Common ground. , 2020, Lancet.

[25]  Emiel Krahmer,et al.  Computational Generation of Referring Expressions: A Survey , 2012, CL.

[26]  Maxine Eskénazi,et al.  From rule-based to data-driven lexical entrainment models in spoken dialog systems , 2015, Comput. Speech Lang..

[27]  Robert M. Krauss,et al.  Effect of referent similarity and communication mode on verbal encoding , 1967 .

[28]  Siobhan Chapman Logic and Conversation , 2005 .

[29]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[30]  Mark T. Keane,et al.  Efficient creativity: constraint-guided conceptual combination , 2000, Cogn. Sci..

[31]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Marilyn A. Walker,et al.  Learning Content Selection Rules for Generating Object Descriptions in Dialogue , 2005, J. Artif. Intell. Res..

[33]  Robert Dale,et al.  Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use , 2011, EMNLP.

[34]  Rachel Ryskin,et al.  People as contexts in conversation , 2015 .

[35]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[36]  Roger Levy,et al.  Speakers optimize information density through syntactic reduction , 2006, NIPS.

[37]  Christopher Potts,et al.  Pragmatically Informative Image Captioning with Character-Level Inference , 2018, NAACL.

[38]  S. Garrod,et al.  Saying what you mean in dialogue: A study in conceptual and semantic co-ordination , 1987, Cognition.

[39]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41]  Kees van Deemter,et al.  Typicality and Object Reference , 2013, CogSci.

[42]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[43]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[44]  H. H. Clark,et al.  Conceptual pacts and lexical choice in conversation. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[45]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Extended Abstract) , 2017, IJCAI.

[46]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[47]  David D. McDonald Subsequent reference: syntactic and rhetorical constraints , 1978, TINLAP '78.

[48]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[49]  Herbert H. Clark,et al.  Grounding in communication , 1991, Perspectives on socially shared cognition.

[50]  Amy Isard,et al.  Modelling alignment for affective dialogue , 2005 .

[51]  Pamela A. Downing On the Creation and Use of English Compound Nouns. , 1977 .

[52]  Dan Klein,et al.  Reasoning about Pragmatics with Neural Listeners and Speakers , 2016, EMNLP.

[53]  Stefan Kopp,et al.  An Alignment-Capable Microplanner for Natural Language Generation , 2009, ENLG.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[56]  Laura Stoia,et al.  Noun Phrase Generation for Situated Dialogs , 2006, INLG.

[57]  Elia Bruni,et al.  The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue , 2019, ACL.

[58]  Samy Bengio,et al.  Context-Aware Captions from Context-Agnostic Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Amanda Stent,et al.  Automatic Evaluation of Referring Expression Generation Using Corpora ∗ , 2005 .

[60]  Stefan Lee,et al.  Evaluating Visual Conversational Agents via Cooperative Human-AI Games , 2017, HCOMP.

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[63]  Nicholas Roy,et al.  Leveraging Past References for Robust Language Grounding , 2019, CoNLL.

[64]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.