论文信息 - Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Vishvak S. Murahari | Dhruv Batra | Devi Parikh | Abhishek Das | Vishvak Murahari

[1] L. Carin,et al. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[3] Hanwang Zhang,et al. Two Causal Principles for Improving Visual Dialog , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Xingxing Zhang,et al. DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue , 2019, AAAI.

[5] Jianfeng Gao,et al. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[6] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[7] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[9] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[10] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[11] Abhishek Das,et al. Improving Generative Visual Dialog by Answering Diverse Questions , 2019, EMNLP.

[12] Hal Daumé,et al. Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning , 2019, EMNLP.

[13] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[14] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[15] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[16] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[17] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[18] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19] Jesse Thomason,et al. Vision-and-Dialog Navigation , 2019, CoRL.

[20] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21] Song-Chun Zhu,et al. Reasoning Visual Dialogs With Structural and Partial Observations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] José M. F. Moura,et al. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog , 2019, NAACL.

[25] Dacheng Tao,et al. Image-Question-Answer Synergistic Network for Visual Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Zheng-Jun Zha,et al. Making History Matter: Gold-Critic Sequence Training for Visual Dialog , 2019, ArXiv.

[27] Zheng-Jun Zha,et al. Making History Matter: History-Advantage Sequence Training for Visual Dialog , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Jung-Woo Ha,et al. Large-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation , 2019, ICLR.

[29] Yu Cheng,et al. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog , 2019, ACL.

[30] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.

[32] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[33] Zhiwu Lu,et al. Recursive Visual Attention in Visual Dialog , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[36] Raffaella Bernardi,et al. Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat , 2018, NAACL.

[37] José M. F. Moura,et al. Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.

[38] Jason Weston,et al. Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[39] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[40] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[41] Svetlana Lazebnik,et al. Two Can Play This Game: Visual Dialog with Discriminative Question Generation and Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[43] Philip H. S. Torr,et al. FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Qi Wu,et al. Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[47] Jiasen Lu,et al. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model , 2017, NIPS.

[48] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49] Olivier Pietquin,et al. End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[50] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[53] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[57] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[59] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[61] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[62] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[63] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[64] Byoung-Tak Zhang,et al. Dual Attention Networks for Visual Reference Resolution in Visual Dialog , 2019, EMNLP.

[65] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.