Structured Two-Stream Attention Network for Video Question Answering

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

[1]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[3]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Vinay P. Namboodiri,et al.  Differential Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[9]  Heng Tao Shen,et al.  Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA , 2018, ACM Multimedia.

[10]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[13]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Hui Chen,et al.  Temporal-Difference Learning With Sampling Baseline for Image Captioning , 2018, AAAI.

[17]  Wei Zhang,et al.  Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering , 2017, AAAI.

[18]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[19]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[20]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Heng Tao Shen,et al.  From Pixels to Objects: Cubic Visual Attention for Visual Question Answering , 2018, IJCAI.

[27]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28]  Heng Tao Shen,et al.  Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning , 2017, IJCAI.

[29]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[30]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Wei Zhang,et al.  R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering , 2018, KDD.