CoDraw: Visual Dialog for Collaborative Drawing

In this work, we propose a goal-driven collaborative task that contains vision, language, and action in a virtual environment as its core components. Specifically, we develop a collaborative `Image Drawing' game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. Two players, Teller and Drawer, are involved. The Teller sees an abstract scene containing multiple clip arts in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip arts. The two players communicate via two-way communication using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of 138K messages exchanged between a Teller and a Drawer from Amazon Mechanical Turk (AMT). We analyze our dataset and present three models to model the players' behaviors, including an attention model to describe and draw multiple clip arts at each round. The attention models are quantitatively compared to the other models to show how the conventional approaches work for this new task. We also present qualitative visualizations.

[1]  Kallirroi Georgila,et al.  An ISU Dialogue System Exhibiting Reinforcement Learning of Dialogue Policies: Generic Slot-Filling in the TALK In-car System , 2006, EACL.

[2]  D. Lewis Languages and Language , 1975 .

[3]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Christopher D. Manning,et al.  Naturalizing a Programming Language via Interactive Learning , 2017, ACL.

[5]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Xiang Zhang,et al.  Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems , 2015, ICLR.

[7]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[8]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[10]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[11]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[12]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[15]  Jason Weston,et al.  Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[16]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[17]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  C. Lawrence Zitnick,et al.  Zero-Shot Learning via Visual Abstraction , 2014, ECCV.

[20]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[21]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[22]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[23]  José M. F. Moura,et al.  Visual Dialog , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[25]  Jason Weston,et al.  Learning through Dialogue Interactions by Asking Questions , 2016, ICLR.

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[28]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[30]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[31]  Olivier Pietquin,et al.  End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[32]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[34]  David Lewis Convention: A Philosophical Study , 1986 .

[35]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[37]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[38]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Devi Parikh,et al.  It Takes Two to Tango: Towards Theory of AI's Mind , 2017, ArXiv.

[40]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Matthew R. Walter,et al.  Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation , 2016, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[43]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[44]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Oliver Lemon,et al.  A Simple and Generic Belief Tracking Mechanism for the Dialog State Tracking Challenge: On the believability of observed information , 2013, SIGDIAL Conference.

[46]  Matthew R. Walter,et al.  A framework for learning semantic maps from grounded natural language descriptions , 2014, Int. J. Robotics Res..

[47]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[48]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[49]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.