论文信息 - Visual Dialog

Visual Dialog

We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial contains 1 dialog (10 question-answer pairs) on ~140k images from the COCO dataset, with a total of ~1.4M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders (Late Fusion, Hierarchical Recurrent Encoder and Memory Network) and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies. Our dataset, code, and trained models will be released publicly at https://visualdialog.org. Putting it all together, we demonstrate the first visual chatbot!.

[1] Tim Paek. Empirical Methods for Evaluating Dialog Systems , 2001, SIGDIAL Workshop.

[2] Kallirroi Georgila,et al. An ISU Dialogue System Exhibiting Reinforcement Learning of Dialogue Policies: Generic Slot-Filling in the TALK In-car System , 2006, EACL.

[3] Rob Miller,et al. VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[4] Jeffrey P. Bigham,et al. VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[5] Cristian Danescu-Niculescu-Mizil,et al. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[6] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[7] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Kewei Tu,et al. Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[9] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[11] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12] Donald Geman,et al. Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[13] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[16] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[17] Jason Weston,et al. Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[18] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.

[19] Joelle Pineau,et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.

[20] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[21] Licheng Yu,et al. Visual Madlibs: Fill in the blank Image Generation and Question Answering , 2015, ArXiv.

[22] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[26] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[28] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[31] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Dhruv Batra,et al. Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[33] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[34] Gordon Christie,et al. Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions , 2016, EMNLP.

[35] Jianfeng Gao,et al. Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[36] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[37] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[38] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[39] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Gordon Christie,et al. Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes , 2016, EMNLP.

[42] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Allan Jabri,et al. Revisiting Visual Question Answering Baselines , 2016, ECCV.

[45] Dhruv Batra,et al. Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.

[46] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[47] Dhruv Batra,et al. Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[48] Yash Goyal,et al. Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Dan Klein,et al. Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[50] Matthew R. Walter,et al. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[51] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[52] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[53] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.

[54] Peter Young,et al. Smart Reply: Automated Response Suggestion for Email , 2016, KDD.

[55] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[56] Trevor Darrell,et al. Segmentation from Natural Language Expressions , 2016, ECCV.

[57] Xiang Zhang,et al. Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems , 2015, ICLR.

[58] Dhruv Batra,et al. Measuring Machine Intelligence Through Visual Question Answering , 2016, AI Mag..

[59] Yoshua Bengio,et al. Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus , 2016, ACL.

[60] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[62] Stefan Lee,et al. The Promise of Premise: Harnessing Question Premises in Visual Question Answering , 2017, EMNLP.

[63] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[64] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Joelle Pineau,et al. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[66] Stefan Lee,et al. Evaluating Visual Conversational Agents via Cooperative Human-AI Games , 2017, HCOMP.

[67] Limin Wang,et al. Knowledge Guided Disambiguation for Large-Scale Scene Classification With Multi-Resolution CNNs , 2016, IEEE Transactions on Image Processing.

[68] Jiasen Lu,et al. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model , 2017, NIPS.

[69] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[73] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[74] Jianfeng Gao,et al. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[75] Olivier Pietquin,et al. End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[76] S. Moss. Listen , 2017 .

[77] Jason Weston,et al. Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[78] Philip H. S. Torr,et al. FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.