Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

GuessWhat?! is a visual dialog guessing game which incorporates a Questioner agent that generates a sequence of questions, while an Oracle agent answers the respective questions about a target object in an image. Based on this dialog history between the Questioner and the Oracle, a Guesser agent makes a final guess of the target object. While previous work has focused on dialogue policy optimization and visual-linguistic information fusion, most work learns the vision-linguistic encoding for the three agents solely on the GuessWhat?! dataset without shared and prior knowledge of vision-linguistic representation. To bridge these gaps, this paper proposes new Oracle, Guesser and Questioner models that take advantage of a pretrained vision-linguistic model, VilBERT. For Oracle model, we introduce a two-way background/target fusion mechanism to understand both intra and inter-object questions. For Guesser model, we introduce a state-estimator that best utilizes VilBERT’s strength in single-turn referring expression comprehension. For the Questioner, we share the state-estimator from pretrained Guesser with Questioner to guide the question generator. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7%, 10%, 12% for Oracle, Guesser and End-to-End Questioner respectively.

[1]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[2]  Qi Wu,et al.  Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards , 2017, ArXiv.

[3]  Xinlei Chen,et al.  Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Davide Belli,et al.  Adding Object Detection Skills to Visual Dialogue Agents , 2018, ECCV Workshops.

[6]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Qi Wu,et al.  What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Qi Wu,et al.  Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[10]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Matthew Turk,et al.  What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog , 2019, ACL.

[13]  Wei Pang,et al.  Visual Dialogue State Tracking for Question Generation , 2020, AAAI.

[14]  Joelle Pineau,et al.  Hierarchical Neural Network Generative Models for Movie Dialogues , 2015, ArXiv.

[15]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Philippe Preux,et al.  Visual Reasoning with Multi-hop Feature Modulation , 2018, ECCV.

[17]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[18]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zheng-Jun Zha,et al.  Making History Matter: History-Advantage Sequence Training for Visual Dialog , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Luciana Benotti,et al.  On the role of effective and referring questions in GuessWhat?! , 2020, ALVR.

[22]  Volker Tresp,et al.  Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets with Tempered Policy Gradient , 2018, LaCATODA@IJCAI.

[23]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[25]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[26]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Raffaella Bernardi,et al.  Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat , 2018, NAACL.

[28]  Olivier Pietquin,et al.  End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[29]  Qi Wu,et al.  Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Qi Wu,et al.  An Active Information Seeking Model for Goal-oriented Vision-and-Language Tasks , 2018, ArXiv.

[31]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[32]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[33]  Stefan Lee,et al.  Evaluating Visual Conversational Agents via Cooperative Human-AI Games , 2017, HCOMP.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Chen Huang,et al.  Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[37]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[38]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[39]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[40]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.