Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks

Multi-turn video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In this paper, we study the problem of multi-turn video question answering from the viewpoint of multi-stream hierarchical attention context reinforced network learning. We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure. We then develop the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context-aware question embedding. We next devise a multi-step reasoning process to enhance the multi-stream hierarchical attention context network learning method. We finally predict the multiple-choice answer from the candidate answer set and further develop the reinforced decoder network to generate the open-ended natural language answer for multi-turn video question answering. We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method.

[1]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[2]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[3]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[4]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[5]  Wei-Ying Ma,et al.  Hierarchical Recurrent Attention Network for Response Generation , 2017, AAAI.

[6]  Yuxin Peng,et al.  Cross-modal Bidirectional Translation via Reinforcement Learning , 2018, IJCAI.

[7]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[8]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[9]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[12]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[13]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Zhou Yu,et al.  Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[22]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[23]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[24]  Mubarak Shah,et al.  Video Fill in the Blank with Merging LSTMs , 2016, ArXiv.

[25]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[27]  Xiaolong Wang,et al.  Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering , 2015, ACL.

[28]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Bowen Zhou,et al.  Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation , 2016, AAAI.

[30]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[31]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yi Yang,et al.  Uncovering Temporal Context for Video Question and Answering , 2015, ArXiv.

[33]  Jason Weston,et al.  Dialog-based Language Learning , 2016, NIPS.

[34]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[35]  Jiaya Jia,et al.  Visual Question Answering with Question Representation Update (QRU) , 2016, NIPS.

[36]  Vinay P. Namboodiri,et al.  Differential Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Xin Wang,et al.  Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Yuxin Peng,et al.  Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[40]  Ning Zhang,et al.  Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[43]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Yuxin Peng,et al.  Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network , 2017, IEEE Transactions on Image Processing.

[47]  Yueting Zhuang,et al.  Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.

[48]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.