Redundancy-aware Transformer for Video Question Answering

This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introducesneighboring-frame redundancy that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce thecross-modal redundancy by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this \underlineR edundancy-\underlinea ware trans\underlineformer (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.

[1]  Fuli Feng,et al.  Partial Annotation-based Video Moment Retrieval via Iterative Learning , 2023, ACM Multimedia.

[2]  Shengyu Zhang,et al.  Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-based Active Learning , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Meng Wang,et al.  Transformer-Based Visual Grounding with Cross-Modality Interaction , 2023, ACM Transactions on Multimedia Computing, Communications, and Applications.

[4]  Xirong Li,et al.  Partially Relevant Video Retrieval , 2022, ACM Multimedia.

[5]  Xiang Wang,et al.  Equivariant and Invariant Grounding for Video Question Answering , 2022, ACM Multimedia.

[6]  H. Zhang,et al.  Long-term Leap Attention, Short-term Periodic Shift for Video Classification , 2022, ACM Multimedia.

[7]  Shuicheng Yan,et al.  Video Graph Transformer for Video Question Answering , 2022, ECCV.

[8]  Xiang Wang,et al.  Invariant Grounding for Video Question Answering , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Juan Carlos Niebles,et al.  Revisiting the “Video” in Video-Language Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Liqing Zhang,et al.  From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Min Peng,et al.  Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering , 2022, IJCAI.

[12]  Tat-Seng Chua,et al.  Video Question Answering: Datasets, Algorithms and Challenges , 2022, EMNLP.

[13]  Xiangnan He,et al.  Discovering Invariant Rationales for Graph Neural Networks , 2022, ICLR.

[14]  Tat-Seng Chua,et al.  Video Moment Retrieval With Cross-Modal Neural Architecture Search , 2022, IEEE Transactions on Image Processing.

[15]  Angela Yao,et al.  Video as Conditional Graph Hierarchy for Multi-Granular Question Answering , 2021, AAAI.

[16]  Yu-Gang Jiang,et al.  Efficient Video Transformers with Spatial-Temporal Token Selection , 2021, ECCV.

[17]  Xiangnan He,et al.  Selective Dependency Aggregation for Action Classification , 2021, ACM Multimedia.

[18]  Tat-Seng Chua,et al.  VidVRD 2021: The Third Grand Challenge on Video Relation Detection , 2021, ACM Multimedia.

[19]  Xindi Shang,et al.  Video Visual Relation Detection via Iterative Inference , 2021, ACM Multimedia.

[20]  Tat-Seng Chua,et al.  Interventional Video Relation Detection , 2021, ACM Multimedia.

[21]  Guoqing Wang,et al.  Progressive Graph Attention Network for Video Question Answering , 2021, ACM Multimedia.

[22]  Truyen Tran,et al.  Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering , 2021, IJCAI.

[23]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[24]  Meng Wang,et al.  Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.

[25]  Angela Yao,et al.  NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kwanghoon Sohn,et al.  Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xirong Li,et al.  Dual Encoding for Video Retrieval by Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Kate Saenko,et al.  VA-RED2: Video Adaptive Redundancy Reduction , 2021, ICLR.

[29]  Sai Srinivas Kancheti,et al.  Beyond VQA: Generating Multi-word Answers and Rationales to Visual Questions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Meng Jian,et al.  Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts , 2020, ACM Multimedia.

[31]  Sheng Tang,et al.  Visual Relation Grounding in Videos , 2020, ECCV.

[32]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[33]  Yahong Han,et al.  Reasoning with Heterogeneous Graph Alignment for Video Question Answering , 2020, AAAI.

[34]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Truyen Tran,et al.  Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yu Cao,et al.  Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[37]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[38]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Quanshi Zhang,et al.  Interpreting CNNs via Decision Trees , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Andrew Slavin Ross,et al.  Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , 2017, IJCAI.

[44]  Li Zhang,et al.  Spatially Adaptive Computation Time for Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Marco Tulio Ribeiro,et al.  “Why Should I Trust You?”: Explaining the Predictions of Any Classifier , 2016, NAACL.

[46]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).