Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks

Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. And due to the complexity of video information, the methods of image dialog might be ineffectively applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions which contain multi-grained sequence information. Then, we design a hierarchical dialog history encoder to learn the context-aware question representation and a two-stream video encoder to learn the context-aware video representation. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.

[1]  Yueting Zhuang,et al.  Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.

[2]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Long Chen,et al.  Video Question Answering via Attribute-Augmented Attention Network Learning , 2017, SIGIR.

[8]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[9]  Yu Zhang,et al.  Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[10]  Anoop Cherian,et al.  End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[15]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[18]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[21]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Svetlana Lazebnik,et al.  Two Can Play This Game: Visual Dialog with Discriminative Question Generation and Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Yu-Jung Heo,et al.  Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog , 2018, NeurIPS.

[26]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[27]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[28]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[29]  José M. F. Moura,et al.  Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Philip H. S. Torr,et al.  FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[33]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[35]  Jason Weston,et al.  Dialog-based Language Learning , 2016, NIPS.

[36]  Andrew Crane,et al.  Are You Talking to Me? Stakeholder Communication and the Risks and Rewards of Dialogue , 2003 .

[37]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Bohyung Han,et al.  Visual Reference Resolution using Attention Memory for Visual Dialog , 2017, NIPS.

[39]  Qi Wu,et al.  Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xiaolong Wang,et al.  Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering , 2015, ACL.

[42]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bowen Zhou,et al.  Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation , 2016, AAAI.

[44]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[45]  Jiasen Lu,et al.  Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model , 2017, NIPS.

[46]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[47]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.