论文信息 - Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Open-ended video question answering aims to automatically generate the natural-language answer from referenced video contents according to the given question. Currently, most existing approaches focus on short-form video question answering with multi-modal recurrent encoder-decoder networks. Although these works have achieved promising performance, they may still be ineffectively applied to long-form video question answering due to the lack of long-range dependency modeling and the suffering from the heavy computational cost. To tackle these problems, we propose a fast Hierarchical Convolutional Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context. We then devise a multi-scale attentive decoder to incorporate multi-layer video representations for answer generation, which avoids the information missing of the top encoder layer. The extensive experiments show the effectiveness and efficiency of our method.

Zhu Zhang | Xiaofei He | Zhou Zhao | Jingkuan Song | Zhijie Lin

[1] Deng Cai,et al. Unifying the Video and Question Attentions for Open-Ended Video Question Answering , 2017, IEEE Transactions on Image Processing.

[2] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[4] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Ramakant Nevatia,et al. Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Tegan Maharaj,et al. A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[8] Zhou Yu,et al. Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[9] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[10] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[11] James Pustejovsky. Proceedings of the 32nd annual meeting on Association for Computational Linguistics , 1994 .

[12] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13] Jiaya Jia,et al. Visual Question Answering with Question Representation Update (QRU) , 2016, NIPS.

[14] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[16] Deng Cai,et al. Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[17] Qi Wu,et al. Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[18] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[19] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[20] Kun Gao,et al. Spatio-Temporal Context Networks for Video Question Answering , 2017, PCM.

[21] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[22] Mubarak Shah,et al. Video Fill in the Blank with Merging LSTMs , 2016, ArXiv.

[23] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[25] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).