Frame Augmented Alternating Attention Network for Video Question Answering

Vision and language understanding is one of the most fundamental and challenging problems in Multimedia Intelligence. Simultaneously understanding video actions with a related natural language question, and further produces accurate answer is even more challenging since it requires joint modeling information across modality. In the past few years, some studies begin to attack this problem by utilizing attention enhanced deep neural networks. However, simple attention mechanisms such as unidirectional attention fail to yield a better mapping between different modalities. Moreover, none of these Video QA models explore high-level semantics in augmented video-frame level. In this paper, we augmented each frame representation with its context information by a novel feature extractor that combines the advantages of Resnet and a variant of C3D. In addition, we proposed a novel alternating attention network which can alternately attend frame regions, video frames and words in the question in multi-turns. This yields better joint representations of video and question, further help the deep model to discover the deeper relationship between two modalities. Our method outperforms the state-of-the-art Video QA models on two existing video question answering datasets. Further ablation studies proved that our feature extractor and the alternating attention mechanism can improve the performance jointly.

[1]  Niklas Carlsson,et al.  Optimized Adaptive Streaming of Multi-video Stream Bundles , 2017, IEEE Transactions on Multimedia.

[2]  Zhou Yu,et al.  Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding , 2018, IJCAI.

[3]  Noah A. Smith,et al.  Question Generation via Overgenerating Transformations and Ranking , 2009 .

[4]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[7]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[8]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Chenliang Xu,et al.  Dancelets Mining for Video Recommendation Based on Dance Styles , 2017, IEEE Transactions on Multimedia.

[11]  Gang Hua,et al.  Multimedia Big Data Computing , 2015, IEEE Multim..

[12]  Yueting Zhuang,et al.  Video Question Answering via Hierarchical Dual-Level Attention Network Learning , 2017, ACM Multimedia.

[13]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Meng Wang,et al.  Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval , 2017, IEEE Transactions on Multimedia.

[17]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Long Chen,et al.  Video Question Answering via Attribute-Augmented Attention Network Learning , 2017, SIGIR.

[22]  Jun Yu,et al.  ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[23]  Lei Yu,et al.  Question Quality Analysis and Prediction in Community Question Answering Services with Coupled Mutual Reinforcement , 2017, IEEE Transactions on Services Computing.

[24]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[26]  Michael S. Ryoo,et al.  Temporal attention filters for human activity recognition in videos , 2016, ArXiv.

[27]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[28]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[29]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[30]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[31]  Changsheng Xu,et al.  Cross-Domain Feature Learning in Multimedia , 2015, IEEE Transactions on Multimedia.

[32]  Xindong Wu,et al.  Learning on Big Graph: Label Inference and Regularization with Anchor Hierarchy , 2017, IEEE Transactions on Knowledge and Data Engineering.

[33]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[34]  Yue Gao,et al.  Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information , 2013, IEEE Transactions on Multimedia.

[35]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[38]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[39]  Wei Zhang,et al.  Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering , 2017, AAAI.

[40]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.