A Universal Quaternion Hypergraph Network for Multimodal Video Question Answering

Fusion and interaction of multimodal features are essential for video question answering. Structural information composed of the relationships between different objects in videos is very complex, which restricts understanding and reasoning. In this paper, we propose a quaternion hypergraph network (QHGN) for multimodal video question answering, to simultaneously involve multimodal features and structural information. Since quaternion operations are suitable for multimodal interactions, four components of the quaternion vectors are applied to represent the multimodal features. Furthermore, we construct a hypergraph based on the visual objects detected in the video. Most importantly, the quaternion hypergraph convolution operator is theoretically derived to realize multimodal and relational reasoning. Question and candidate answers are embedded in quaternion space, and a Q&A reasoning module is creatively designed for selecting the answer accurately. Moreover, the unified framework can be extended to other video-text tasks with different quaternion decoders. Experimental evaluations on the TVQA dataset and DramaQA dataset show that our method achieves state-of-the-art performance.

[1]  Guanglu Sun,et al.  Video Question Answering: a Survey of Models and Datasets , 2021, Mobile Networks and Applications.

[2]  Shaoyi Du,et al.  Hypergraph Learning: Methods and Practices , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Byoung-Tak Zhang,et al.  Co-Attentional Transformers for Story-Based Video Understanding , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jing Liu,et al.  Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering , 2020, ACM Multimedia.

[5]  Chang D. Yoo,et al.  Modality Shifting Attention Network for Multi-Modal Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mohit Bansal,et al.  Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA , 2020, ACL.

[7]  T. Abdelzaher,et al.  Hypergraph Learning with Line Expansion , 2020, ArXiv.

[8]  Peng Gao,et al.  Character Matters: Video Story Understanding with Character-Aware Relations , 2020, ArXiv.

[9]  Byoung-Tak Zhang,et al.  DramaQA: Character-Centered Video Story Understanding with Hierarchical QA , 2020, AAAI.

[10]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[11]  Thao Minh Le,et al.  Dynamic Language Binding in Relational Visual Reasoning , 2020, IJCAI.

[12]  Jiebo Luo,et al.  Joint Commonsense and Relation Reasoning for Image and Video Captioning , 2020, AAAI.

[13]  Yahong Han,et al.  Reasoning with Heterogeneous Graph Alignment for Video Question Answering , 2020, AAAI.

[14]  Yueting Zhuang,et al.  Frame Augmented Alternating Attention Network for Video Question Answering , 2020, IEEE Transactions on Multimedia.

[15]  Chenhui Chu,et al.  BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Truyen Tran,et al.  Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Chiori Hori,et al.  Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering , 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Ruochi Zhang,et al.  Hyper-SAGNN: a self-attention based graph neural network for hypergraphs , 2019, ICLR.

[19]  Chenhui Chu,et al.  KnowIT VQA: Answering Knowledge-Based Questions about Videos , 2019, AAAI.

[20]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[21]  Siu Cheung Hui,et al.  Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks , 2019, ACL.

[22]  Chang D. Yoo,et al.  Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[23]  Benjamin J Raphael,et al.  Random Walks on Hypergraphs with Edge-Dependent Vertex Weights , 2019, ICML.

[24]  Licheng Yu,et al.  TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.

[25]  Junyeong Kim,et al.  Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Song Bai,et al.  Hypergraph Convolution and Hypergraph Attention , 2019, Pattern Recognit..

[27]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Yue Gao,et al.  Hypergraph Neural Networks , 2018, AAAI.

[29]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[30]  Byoung-Tak Zhang,et al.  Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[31]  Partha Pratim Talukdar,et al.  HyperGCN: A New Method of Training Graph Convolutional Networks on Hypergraphs , 2018 .

[32]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[33]  Ying Zhang,et al.  Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition , 2018, INTERSPEECH.

[34]  Titouan Parcollet,et al.  Quaternion Recurrent Neural Networks , 2018, ICLR.

[35]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[36]  T.-H. Hubert Chan,et al.  Generalizing the Hypergraph Laplacian via a Diffusion Process with Mediators , 2018, COCOON.

[37]  Bo Wang,et al.  Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents , 2018, AAAI.

[38]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[39]  Anthony S. Maida,et al.  Deep Quaternion Networks , 2017, 2018 International Joint Conference on Neural Networks (IJCNN).

[40]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Deng Cai,et al.  Unifying the Video and Question Attentions for Open-Ended Video Question Answering , 2017, IEEE Transactions on Image Processing.

[42]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[44]  Shih-Fu Chang,et al.  Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification , 2017, IEEE Transactions on Multimedia.

[45]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[47]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[49]  Li-Jia Li,et al.  Dense Captioning with Joint Inference and Visual Context , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[51]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[56]  Akira Hirose,et al.  Quaternion Neural-Network-Based PolSAR Land Classification in Poincare-Sphere-Parameter Space , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[57]  Richard Szeliski,et al.  Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[60]  Soo-Chang Pei,et al.  Efficient implementation of quaternion Fourier transform, convolution, and correlation by 2-D complex FFT , 2001, IEEE Trans. Signal Process..

[61]  Stephen J. Sangwine,et al.  Hypercomplex Fourier Transforms of Color Images , 2001, IEEE Transactions on Image Processing.

[62]  William Rowan Hamilton,et al.  XI. On quaternions; or on a new system of imaginaries in algebra , 1848 .

[63]  William Rowan Hamilton,et al.  ON QUATERNIONS, OR ON A NEW SYSTEM OF IMAGINARIES IN ALGEBRA , 1847 .

[64]  Xiao Wu,et al.  Adversarial Multimodal Network for Movie Story Question Answering , 2021, IEEE Transactions on Multimedia.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.