Reasoning about Pragmatics with Neural Listeners and Speakers

We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics. Like previous learned approaches to language generation, our model uses a simple feature-driven architecture (here a pair of neural "listener" and "speaker" models) to ground language in the world. Like inference-driven approaches to pragmatics, our model actively reasons about listener behavior when selecting utterances. For training, our approach requires only ordinary captions, annotated _without_ demonstration of the pragmatic behavior the model ultimately exhibits. In human evaluations on a referring expression game, our approach succeeds 81% of the time, compared to a 69% success rate using existing techniques.

[1]  H. Grice Logic and conversation , 1975 .

[2]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[3]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[4]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[5]  Siobhan Chapman Logic and Conversation , 2005 .

[6]  Ielka van der Sluis,et al.  Evaluating algorithms for the Generation of Referring Expressions using a balanced corpus , 2007, ENLG.

[7]  Luciana Benotti,et al.  A computational account of comparative implicatures for a spoken dialogue agent , 2009, IWCS.

[8]  Michael C. Frank,et al.  Informative communication in word production and word learning , 2009 .

[9]  Dan Klein,et al.  A Game-Theoretic Approach to Generating Spatial Descriptions , 2010, EMNLP.

[10]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Michael C. Frank,et al.  Learning and using language via recursive pragmatic reasoning about other agents , 2013, NIPS.

[12]  Luke S. Zettlemoyer,et al.  Learning Distributions over Logical Forms for Referring Expression Generation , 2013, EMNLP.

[13]  Christopher Potts,et al.  Emergence of Gricean Maxims from Multi-Agent Decision Theory , 2013, NAACL.

[14]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[17]  Dan Klein,et al.  Deep Compositional Question Answering with Neural Module Networks , 2015, ArXiv.

[18]  Mirella Lapata,et al.  Learning to Interpret and Describe Abstract Scenes , 2015, NAACL.

[19]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[20]  Christopher Potts,et al.  Learning in the Rational Speech Acts Model , 2015, ArXiv.

[21]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[22]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  C. Lawrence Zitnick,et al.  Adopting Abstract Images for Semantic Scene Understanding , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.