Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy

Generating goal-oriented questions in Visual Dialogue tasks is a challenging and longstanding problem. State-Of-The-Art systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we design Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model’s conjecture about the referent. We take the GuessWhat?! game as a case-study. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking.

[1]  L. Gleitman,et al.  How words can and cannot be learned by observation , 2011, Proceedings of the National Academy of Sciences.

[2]  J. Baron Thinking and Deciding , 2023 .

[3]  L. Gleitman,et al.  Propose but verify: Fast mapping meets cross-situational word learning , 2013, Cognitive Psychology.

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Kees van Deemter,et al.  Are we Bayesian referring expression generators , 2013 .

[6]  David Schlangen,et al.  Decoding Strategies for Neural Referring Expression Generation , 2018, INLG.

[7]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[9]  Matthew Turk,et al.  What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog , 2019, ACL.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Marc Dymetman,et al.  Char2char Generation with Reranking for the E2E NLG Challenge , 2018, INLG.

[12]  Raffaella Bernardi,et al.  "I've Seen Things You People Wouldn't Believe": Hallucinating Entities in GuessWhat?! , 2021, ACL.

[13]  Wei Pang,et al.  Visual Dialogue State Tracking for Question Generation , 2020, AAAI.

[14]  Guy Emerson,et al.  Leveraging Sentence Similarity in Natural Language Generation: Improving Beam Search using Range Voting , 2019, NGT.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  F. Blain,et al.  Exploring Hypotheses Spaces in Neural Machine Translation , 2017, MTSUMMIT.

[17]  Guy Emerson,et al.  Incremental Beam Manipulation for Natural Language Generation , 2021, EACL.

[18]  T. Lombrozo,et al.  Children adapt their questions to achieve efficient search , 2015, Cognition.

[19]  Abhishek Das,et al.  Improving Generative Visual Dialog by Answering Diverse Questions , 2019, EMNLP.

[20]  Raffaella Bernardi,et al.  The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues , 2021, EACL.

[21]  Trevor Darrell,et al.  Object Hallucination in Image Captioning , 2018, EMNLP.

[22]  P. Wason On the Failure to Eliminate Hypotheses in a Conceptual Task , 1960 .

[23]  Raffaella Bernardi,et al.  Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat , 2018, NAACL.

[24]  Ondrej Dusek,et al.  Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings , 2016, ACL.