论文信息 - Recent Advances in Video Question Answering: A Review of Datasets and Methods

Recent Advances in Video Question Answering: A Review of Datasets and Methods

Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision. Several visual information retrieval techniques like Video Captioning/Description and Video-guided Machine Translation have preceded the task of VQA. VQA helps to retrieve temporal and spatial information from the video scenes and interpret it. In this survey, we review a number of methods and datasets for the task of VQA. To the best of our knowledge, no previous survey has been conducted for the VQA task.

[1] Shu Zhang,et al. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Yahong Han,et al. Reasoning with Heterogeneous Graph Alignment for Video Question Answering , 2020, AAAI.

[3] Yue Gao,et al. Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.

[4] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Chenhui Chu,et al. BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[6] Yale Song,et al. TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Zhou Yu,et al. Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[8] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[9] Truyen Tran,et al. Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Tao Mei,et al. Structured Two-Stream Attention Network for Video Question Answering , 2019, AAAI.

[11] Zhou Zhao,et al. Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks , 2020, IEEE Transactions on Image Processing.

[12] Zhou Zhao,et al. Multichannel Attention Refinement for Video Question Answering , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[13] Xiao Wu,et al. Adversarial Multimodal Network for Movie Story Question Answering , 2021, IEEE Transactions on Multimedia.

[14] Long Chen,et al. Video Question Answering via Attribute-Augmented Attention Network Learning , 2017, SIGIR.

[15] Kewei Tu,et al. Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[16] Ramakant Nevatia,et al. Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Tegan Maharaj,et al. A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Chenhui Chu,et al. KnowIT VQA: Answering Knowledge-Based Questions about Videos , 2020, AAAI.

[19] Runhao Zeng,et al. Location-Aware Graph Convolutional Networks for Video Question Answering , 2020, AAAI.

[20] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[21] Licheng Yu,et al. TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.

[22] Byoung-Tak Zhang,et al. Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[23] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7 , 2018, ArXiv.

[24] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Louis-Philippe Morency,et al. Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Yueting Zhuang,et al. Frame Augmented Alternating Attention Network for Video Question Answering , 2020, IEEE Transactions on Multimedia.

[28] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[29] Houqiang Li,et al. Multi-Question Learning for Visual Question Answering , 2020, AAAI.

[30] Yueting Zhuang,et al. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.

[31] Meng Wang,et al. Question-Aware Tube-Switch Network for Video Question Answering , 2019, ACM Multimedia.

[32] Chenyou Fan,et al. EgoVQA - An Egocentric Video Question Answering Benchmark Dataset , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[33] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[34] Jun Yu,et al. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[35] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Byoung-Tak Zhang,et al. DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[37] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Qingming Huang,et al. Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[39] Zhou Yu,et al. Compositional Attention Networks With Two-Stream Fusion for Video Question Answering , 2020, IEEE Transactions on Image Processing.

[40] Jindong Chen,et al. Learning Question-Guided Video Representation for Multi-Turn Video Question Answering , 2019, ViGIL@NeurIPS.

[41] Rada Mihalcea,et al. LifeQA: A Real-life Dataset for Video Question Answering , 2020, LREC.

[42] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[44] Qi Wu,et al. Visual Question Answering: A Tutorial , 2017, IEEE Signal Processing Magazine.

[45] Byoung-Tak Zhang,et al. DramaQA: Character-Centered Video Story Understanding with Hierarchical QA , 2020, AAAI.

[46] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[47] Junyeong Kim,et al. Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Chuang Gan,et al. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[49] Zhou Zhao,et al. Multi-interaction Network with Object Relation for Video Question Answering , 2019, ACM Multimedia.

[50] Bohyung Han,et al. MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[51] Yueting Zhuang,et al. Video Question Answering via Hierarchical Dual-Level Attention Network Learning , 2017, ACM Multimedia.