Video Dialog via Progressive Inference and Cross-Transformer

Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history. And different from single-turn video question answering, the additional dialog history is important for video dialog, which often includes contextual information for the question. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more advanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multi-modal fusion problem, we propose a cross-transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challenging but significant for a complete video dialog system. We evaluate our method on two large-scale datasets, and the extensive experiments show the effectiveness of our method.

[1]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Svetlana Lazebnik,et al.  Two Can Play This Game: Visual Dialog with Discriminative Question Generation and Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Bohyung Han,et al.  Visual Reference Resolution using Attention Memory for Visual Dialog , 2017, NIPS.

[4]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5]  Yu-Jung Heo,et al.  Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog , 2018, NeurIPS.

[6]  Qi Wu,et al.  Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[8]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[10]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Philip H. S. Torr,et al.  FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[14]  Ramakanth Pasunuru,et al.  Game-Based Video-Context Dialogue , 2018, EMNLP.

[15]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jiasen Lu,et al.  Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model , 2017, NIPS.

[17]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Anoop Cherian,et al.  End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[22]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[23]  José M. F. Moura,et al.  Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Yueting Zhuang,et al.  Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.