论文信息 - Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.

[1] Anton van den Hengel,et al. Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] H. Zhang,et al. Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition , 2015, Neurocomputing.

[3] Heng Tao Shen,et al. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning , 2017, IJCAI.

[4] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5] Christopher Kanan,et al. Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[6] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[7] Christopher Kanan,et al. Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[10] Meng Wang,et al. Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud , 2017, IEEE Transactions on Image Processing.

[11] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Long Chen,et al. Video Question Answering via Attribute-Augmented Attention Network Learning , 2017, SIGIR.

[15] Qi Tian,et al. Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[16] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[17] Fuchun Sun,et al. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Meng Wang,et al. Multimedia Question Answering , 2010, IEEE MultiMedia.

[19] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[20] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[23] Yueting Zhuang,et al. Video question answering via multi-granularity temporal attention network learning , 2018, ICIMCS '18.

[24] Wei Xu,et al. ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[25] Yueting Zhuang,et al. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.

[26] Yue Gao,et al. Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval , 2013, ACM Multimedia.

[27] Tao Mei,et al. Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.