Graph-Based Multi-Interaction Network for Video Question Answering

Video question answering is an important task combining both Natural Language Processing and Computer Vision, which requires a machine to obtain a thorough understanding of the video. Most existing approaches simply capture spatio-temporal information in videos by using a combination of recurrent and convolutional neural networks. Nonetheless, most previous work focus on only salient frames or regions, which normally lacks some significant details, such as potential location and action relations. In this paper, we propose a new method called Graph-based Multi-interaction Network for video question answering. In our model, a new attention mechanism named multi-interaction is designed to capture both element-wise and segment-wise sequence interactions simultaneously, which can be found between and inside the multi-modal inputs. Moreover, we propose a graph-based relation-aware neural network to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves state-of-the-art performance.

[1]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[3]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Joshua B. Tenenbaum,et al.  Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense , 2020, Engineering.

[5]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[9]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[10]  Zhou Zhao,et al.  Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks , 2020, IEEE Transactions on Image Processing.

[11]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[12]  Nicholas Jing Yuan,et al.  Multi-Turn Video Question Generation via Reinforced Multi-Choice Attention Network , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Khalil Sima'an,et al.  Graph Convolutional Encoders for Syntax-aware Neural Machine Translation , 2017, EMNLP.

[14]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Joan Bruna,et al.  Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[17]  Yichen Wei,et al.  Learning Region Features for Object Detection , 2018, ECCV.

[18]  Yueting Zhuang,et al.  Temporality-enhanced knowledgememory network for factoid question answering , 2018, Frontiers of Information Technology & Electronic Engineering.

[19]  Yueting Zhuang,et al.  Disambiguating named entities with deep supervised learning via crowd labels , 2017, Frontiers of Information Technology & Electronic Engineering.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[22]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[23]  Xuelong Li,et al.  The Next Breakthroughs of Artificial Intelligence: The Interdisciplinary Nature of AI , 2020 .

[24]  Zijian Zhang,et al.  Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction , 2020, IEEE Transactions on Image Processing.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[27]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Meng Wang,et al.  Tri-Clustered Tensor Completion for Social-Aware Image Tag Refinement , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[30]  Yuxin Peng,et al.  Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jinhui Tang,et al.  Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation , 2015, ACM Multimedia.

[32]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jun Yu,et al.  Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[34]  Deng Cai,et al.  Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[35]  Zhou Zhao,et al.  Multi-interaction Network with Object Relation for Video Question Answering , 2019, ACM Multimedia.

[36]  Judea Pearl,et al.  Causal Inference , 2010 .

[37]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Abhinav Gupta,et al.  Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Yahong Han,et al.  Explore Multi-Step Reasoning in Video Question Answering , 2018, CoVieW@MM.

[43]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[44]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[45]  Yueting Zhuang,et al.  Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[46]  Tao Mei,et al.  Structured Two-Stream Attention Network for Video Question Answering , 2019, AAAI.

[47]  Xinlei Chen,et al.  Grounded Video Description , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Dacheng Tao,et al.  Deep Multimodal Neural Architecture Search , 2020, ACM Multimedia.

[51]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[52]  Jun Yu,et al.  ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[53]  Yue-Guang Lyu Artificial Intelligence: Enabling Technology to Empower Society , 2020 .

[54]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[55]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[58]  Zhou Zhao,et al.  Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[59]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[61]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Chun Chen,et al.  Challenges and opportunities: from big data to knowledge in AI 2.0 , 2017, Frontiers of Information Technology & Electronic Engineering.

[63]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[65]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[67]  Yueting Zhuang,et al.  Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks , 2019, SIGIR.

[68]  Yunhe Pan,et al.  Multiple Knowledge Representation of Artificial Intelligence , 2020 .

[69]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[70]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Zhou Yu,et al.  Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[72]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Furu Wei,et al.  Video Dialog via Progressive Inference and Cross-Transformer , 2019, EMNLP/IJCNLP.

[74]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[75]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[77]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.