Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans. Code has been made available at this https URL

[1]  Kyunghyun Cho,et al.  Countering Language Drift via Visual Grounding , 2019, EMNLP.

[2]  Simon Kirby,et al.  Compositional Languages Emerge in a Neural Iterated Learning Model , 2020, ICLR.

[3]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yann Dauphin,et al.  Deal or No Deal? End-to-End Learning of Negotiation Dialogues , 2017, EMNLP.

[5]  Ben Poole,et al.  Categorical Reparametrization with Gumble-Softmax , 2017, ICLR 2017.

[6]  Maxine Eskénazi,et al.  Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models , 2019, NAACL.

[7]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[8]  Pararth Shah,et al.  Recommendation as a Communication Game: Self-Supervised Bot-Play for Goal-oriented Dialogue , 2019, EMNLP.

[9]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[10]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Stefan Lee,et al.  Evaluating Visual Conversational Agents via Cooperative Human-AI Games , 2017, HCOMP.

[12]  Ruslan Salakhutdinov,et al.  On Emergent Communication in Competitive Multi-Agent Teams , 2020, AAMAS.

[13]  Eugene Kharitonov,et al.  Compositionality and Generalization In Emergent Languages , 2020, ACL.

[14]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[15]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[16]  Maxine Eskénazi,et al.  Zero-Shot Dialog Generation with Cross-Domain Latent Actions , 2018, SIGDIAL Conference.

[17]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[19]  Evangelos Kanoulas,et al.  Examining Cooperation in Visual Dialog Models , 2017, ArXiv.

[20]  Martial Hebert,et al.  Learning by Asking Questions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  R. Kirk CONVENTION: A PHILOSOPHICAL STUDY , 1970 .

[23]  Yuandong Tian,et al.  Hierarchical Decision Making by Generating and Following Natural Language Instructions , 2019, NeurIPS.

[24]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[25]  Mike Lewis,et al.  Hierarchical Text Generation and Planning for Strategic Dialogue , 2017, ICML.

[26]  Maxine Eskénazi,et al.  Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation , 2018, ACL.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  Tsung-Hsien Wen,et al.  Latent Intention Dialogue Models , 2017, ICML.

[30]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Jiasen Lu,et al.  Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model , 2017, NIPS.

[35]  Stefan Lee,et al.  Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition , 2018, CoRL.

[36]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[37]  Joelle Pineau,et al.  Piecewise Latent Variables for Neural Variational Text Processing , 2016, EMNLP.

[38]  Aurko Roy,et al.  Fast Decoding in Sequence Models using Discrete Latent Variables , 2018, ICML.