Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

Long-term Video Question Answering plays an essential role in visual information retrieval, which aims at generating natural language answers to discretionary free-form questions about the referenced long-term video. Rather than remember the video as a sequence of visual content, humans have an innate cognitive ability to identify the critical moments related to the question at first glance, then tie together the specific evidence around these critical moments for further analysis and reasoning. Motivated by this intuition, we propose the multimodal hierarchical memory attentive networks with two heterogeneous memory subnetworks: the top guided memory network and the bottom enhanced multimodal memory attentive network. The top guided memory network serves as a shallow inference engine to pick relevant and informative moments of questions and obtain salient video content at a coarse-grained level. Subsequently, the bottom enhanced multimodal memory attentive network is designed as an in-depth reasoning engine to perform more accurate attention with cues from video bottom evidence in a fine-grained level to enhance question answering quality. We evaluate the proposed method on three publicly available video question answering benchmarks, namely ActivityNet-QA, MSRVTT-QA, and MSVD-QA. Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos. Extensive ablation studies are carried out to explore the reasons behind the proposed model’s effectiveness.

[1]  Meng Wang,et al.  Question-Aware Tube-Switch Network for Video Question Answering , 2019, ACM Multimedia.

[2]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[3]  Guillaume Lample,et al.  Large Memory Layers with Product Keys , 2019, NeurIPS.

[4]  Jun Yu,et al.  Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[5]  Jun Yu,et al.  ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[6]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jian Yang,et al.  Topic-Oriented Image Captioning Based on Order-Embedding , 2019, IEEE Transactions on Image Processing.

[8]  Junyeong Kim,et al.  Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Shu Zhang,et al.  Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Carl Doersch,et al.  Learning Visual Question Answering by Bootstrapping Hard Attention , 2018, ECCV.

[11]  Deng Cai,et al.  A Better Way to Attend: Attention With Trees for Video Question Answering , 2018, IEEE Transactions on Image Processing.

[12]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[13]  Zhou Yu,et al.  Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[14]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[15]  Gunhee Kim,et al.  A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Bo Wang,et al.  Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents , 2018, AAAI.

[17]  Changshui Zhang,et al.  Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[20]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[21]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[22]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Xiaocheng Feng,et al.  Effective Deep Memory Networks for Distant Supervised Relation Extraction , 2017, IJCAI.

[24]  Yueting Zhuang,et al.  Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.

[25]  Xuanjing Huang,et al.  Mention Recommendation for Twitter with End-to-end Memory Network , 2017, IJCAI.

[26]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  R. Nevatia,et al.  TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[31]  Dong Huk Park,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[32]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[33]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[34]  José M. F. Moura,et al.  VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[36]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[39]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[40]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[42]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[43]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[44]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Zhou Yu,et al.  Compositional Attention Networks With Two-Stream Fusion for Video Question Answering , 2020, IEEE Transactions on Image Processing.

[47]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.