Hierarchical Conditional Relation Networks for Video Question Answering

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.

[1]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Truyen Tran,et al.  Neural Reasoning, Fast and Slow, for Video Question Answering , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[3]  Svetha Venkatesh,et al.  Learning to Reason with Relational Video Representation for Question Answering , 2019, ArXiv.

[4]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[6]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Shu Zhang,et al.  Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiao Liu,et al.  Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding , 2017, ArXiv.

[9]  Jingkuan Song,et al.  Learnable Aggregating Net with Diversity Learning for Video Question Answering , 2019, ACM Multimedia.

[10]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[12]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[14]  Zhou Zhao,et al.  Multi-interaction Network with Object Relation for Video Question Answering , 2019, ACM Multimedia.

[15]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  James J. Little,et al.  Spatio-temporal Relational Reasoning for Video Question Answering , 2019, BMVC.

[20]  Junyeong Kim,et al.  Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  S. Venkatesh,et al.  Learning Deep Matrix Representations , 2017 .

[22]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chuan-Sheng Foo,et al.  Holistic Multi-Modal Memory Network for Movie Question Answering , 2018, IEEE Transactions on Image Processing.

[24]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[25]  Yueting Zhuang,et al.  Video Question Answering via Hierarchical Spatio-Temporal Attention Networks , 2017, IJCAI.

[26]  Bo Wang,et al.  Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents , 2018, AAAI.

[27]  Jun Yu,et al.  Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[28]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[29]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[32]  Byoung-Tak Zhang,et al.  Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[33]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[36]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[37]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[38]  Rainer Lienhart,et al.  Abstracting home video automatically , 1999, MULTIMEDIA '99.

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yahong Han,et al.  Explore Multi-Step Reasoning in Video Question Answering , 2018, CoVieW@MM.

[42]  Long Chen,et al.  Video Question Answering via Attribute-Augmented Attention Network Learning , 2017, SIGIR.

[43]  Sridha Sridharan,et al.  Hierarchical Relational Attention for Video Question Answering , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[44]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[46]  Xing Zhang,et al.  Non-local NetVLAD Encoding for Video Classification , 2018, ECCV Workshops.

[47]  Feng Mao,et al.  Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network , 2018, ECCV Workshops.

[48]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[49]  Meng Wang,et al.  Question-Aware Tube-Switch Network for Video Question Answering , 2019, ACM Multimedia.