论文信息 - Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

GuessWhat?! is a visual dialog guessing game which incorporates a Questioner agent that generates a sequence of questions, while an Oracle agent answers the respective questions about a target object in an image. Based on this dialog history between the Questioner and the Oracle, a Guesser agent makes a final guess of the target object. While previous work has focused on dialogue policy optimization and visual-linguistic information fusion, most work learns the vision-linguistic encoding for the three agents solely on the GuessWhat?! dataset without shared and prior knowledge of vision-linguistic representation. To bridge these gaps, this paper proposes new Oracle, Guesser and Questioner models that take advantage of a pretrained vision-linguistic model, VilBERT. For Oracle model, we introduce a two-way background/target fusion mechanism to understand both intra and inter-object questions. For Guesser model, we introduce a state-estimator that best utilizes VilBERT’s strength in single-turn referring expression comprehension. For the Questioner, we share the state-estimator from pretrained Guesser with Questioner to guide the question generator. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7%, 10%, 12% for Oracle, Guesser and End-to-End Questioner respectively.

Gokhan Tur | Qing Ping | Govind Thattai | Tao Tu | Prem Natarajan

[1] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[2] Qi Wu,et al. Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards , 2017, ArXiv.

[3] Xinlei Chen,et al. Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Davide Belli,et al. Adding Object Detection Skills to Visual Dialogue Agents , 2018, ECCV Workshops.

[6] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Qi Wu,et al. What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Qi Wu,et al. Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[10] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[11] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Matthew Turk,et al. What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog , 2019, ACL.

[13] Wei Pang,et al. Visual Dialogue State Tracking for Question Generation , 2020, AAAI.

[14] Joelle Pineau,et al. Hierarchical Neural Network Generative Models for Movie Dialogues , 2015, ArXiv.

[15] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Philippe Preux,et al. Visual Reasoning with Multi-hop Feature Modulation , 2018, ECCV.

[17] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[18] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Zheng-Jun Zha,et al. Making History Matter: History-Advantage Sequence Training for Visual Dialog , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Luciana Benotti,et al. On the role of effective and referring questions in GuessWhat?! , 2020, ALVR.

[22] Volker Tresp,et al. Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets with Tempered Policy Gradient , 2018, LaCATODA@IJCAI.

[23] Roozbeh Mottaghi,et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[25] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[26] Zhou Yu,et al. Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Raffaella Bernardi,et al. Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat , 2018, NAACL.

[28] Olivier Pietquin,et al. End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[29] Qi Wu,et al. Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Qi Wu,et al. An Active Information Seeking Model for Goal-oriented Vision-and-Language Tasks , 2018, ArXiv.

[31] Lei Zheng,et al. Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[32] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[33] Stefan Lee,et al. Evaluating Visual Conversational Agents via Cooperative Human-AI Games , 2017, HCOMP.

[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Chen Huang,et al. Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[37] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[38] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[39] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[40] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.