Video Story Question Answering with Character-Centric Scene Parsing and Question-Aware Temporal Attention

With the exploding growth of videos, there is an increasing interests for automatic video understanding. Video Story Question Answering (VSQA) proves to be an effective way for benchmarking the comprehension ability of a model. Recent VSQA approaches merely extract visual features from the whole scene or detected objects in each frame. However, it is hard to claim a method really understands a video without considering the characters inside. Additionally, relations and actions acquired by scene parsing are indispensable in the comprehension of video stories. In this work, we incorporate character-centric scene parsing to assist the VSQA task. Our reasoning framework consists of two parts: the first part utilizes question-aware temporal attention to locate the corresponding frame intervals; the second part involves a cross-attention transformer for multiple stream fusion. We train and test our VSQA model on the recently released TVQA dataset, which is the largest VSQA dataset until now. The experiments show that all modules in our framework work collaboratively and significantly improve the overall performance. Ablation studies demonstrate that our scene parsing based framework is efficacious for deeper understanding of video semantics.

[1]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Rui Liu,et al.  Phase Conductor on Multi-layered Attentions for Machine Comprehension , 2017, ArXiv.

[5]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[6]  Zhoujun Li,et al.  DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents , 2016, ACL.

[7]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[8]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[9]  Ji Zhang,et al.  Large-Scale Visual Relationship Understanding , 2018, AAAI.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Bo Wang,et al.  Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents , 2018, AAAI.

[12]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Byoung-Tak Zhang,et al.  Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[14]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[15]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.