Deconfounded Video Moment Retrieval with Causal Intervention

We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. Despite their effectiveness, current models mostly exploit dataset biases while ignoring the video content, thus leading to poor generalizability. We argue that the issue is caused by the hidden confounder in VMR, i.e., temporal location of moments, that spuriously correlates the model input and prediction. How to design robust matching models against the temporal location biases is crucial but, as far as we know, has not been studied yet for VMR. To fill the research gap, we propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction. Specifically, we develop a Deconfounded Cross-modal Matching (DCM) method to remove the confounding effects of moment location. It first disentangles moment representation to infer the core feature of visual content, and then applies causal intervention on the disentangled multimodal input based on backdoor adjustment, which forces the model to fairly incorporate each possible location of the target into consideration. Extensive experiments clearly show that our approach can achieve significant improvement over the state-of-the-art methods in terms of both accuracy and generalization.

[1]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[2]  Richang Hong,et al.  Learning to Compose and Reason with Language Tree Structures for Visual Grounding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[4]  D. Blei,et al.  Causal Inference for Recommender Systems , 2020, RecSys.

[5]  Meng Wang,et al.  Harvesting visual concepts for image search with complex queries , 2012, ACM Multimedia.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Ah-Hwee Tan,et al.  Discovering and Exploiting Causal Dependencies for Robust Mobile Context-Aware Recommenders , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Hao Zhang,et al.  Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[9]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[10]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[12]  Tomoko Ohkuma,et al.  Unbiased Learning for the Causal Effect of Recommendation , 2020, RecSys.

[13]  Zhiwei Xiong,et al.  Dual Path Interaction Network for Video Moment Localization , 2020, ACM Multimedia.

[14]  Meng Wang,et al.  Learning Visual Semantic Relationships for Efficient Visual Retrieval , 2015, IEEE Transactions on Big Data.

[15]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[16]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[19]  Meng Jian,et al.  Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts , 2020, ACM Multimedia.

[20]  Bing Deng,et al.  The Blessings of Unlabeled Background in Untrimmed Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Esa Rahtu,et al.  Uncovering Hidden Challenges in Query-Based Video Moment Retrieval , 2020, BMVC.

[22]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[23]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[24]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Hanwang Zhang,et al.  Interventional Few-Shot Learning , 2020, NeurIPS.

[28]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[29]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[31]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[32]  Xiao-Yang Liu,et al.  Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization , 2020, ACM Multimedia.

[33]  Trevor Darrell,et al.  Localizing Moments in Video with Temporal Language , 2018, EMNLP.

[34]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[35]  Xiangnan He,et al.  Clicks can be Cheating: Counterfactual Recommendation for Mitigating Clickbait Issue , 2020, SIGIR.

[36]  Jingwen Wang,et al.  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2020, AAAI.

[37]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[39]  Anton van den Hengel,et al.  On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law , 2020, NeurIPS.

[40]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[41]  Meng Wang,et al.  Person Re-Identification With Metric Learning Using Privileged Information , 2018, IEEE Transactions on Image Processing.

[42]  Tat-Seng Chua,et al.  Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval , 2020, SIGIR.

[43]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Rui Qiao,et al.  Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Hanwang Zhang,et al.  Two Causal Principles for Improving Visual Dialog , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[50]  Xiangnan He,et al.  Should Graph Convolution Trust Neighbors? A Simple Causal Inference Method , 2020, SIGIR.

[51]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.