Interventional Video Relation Detection

Video Visual Relation Detection (VidVRD) aims to semantically describe the dynamic interactions across visual concepts localized in a video in the form of subject, predicate, object. It can help to mitigate the semantic gap between vision and language in video understanding, thus receiving increasing attention in multimedia communities. Existing efforts primarily leverage the multimodal/spatio-temporal feature fusion to augment the representation of object trajectories as well as their interactions and formulate the prediction of predicates as a multi-class classification task. Despite their effectiveness, existing models ignore the severe long-tailed bias in VidVRD datasets. As a result, the models' prediction will be easily biased towards the popular head predicates (e.g., next-to and in-front-of), thus leading to poor generalizability. To fill the research gap, this paper proposes an Interventional Video Relation Detection (IVRD) approach that aims to improve not only the accuracy but also the robustness of the model prediction. Specifically, to better model the high-level visual predicate, our IVRD consists of two key components: 1) we first learn a set of predicate prototypes, where each prototype vector describes a set of relation references with the same predicate; and 2) we apply a causality-inspired intervention on the model input subject, object, which forces the model to fairly incorporate each possible predicate prototype into consideration. We expect the model to focus more on the visual content of the dynamic interaction between subject and object, rather than the spurious correlations between the model input and predicate labels. Extensive experiments on two popular benchmark datasets show the effectiveness of IVRD and also its advantages in reducing the bad long-tailed bias.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Tat-Seng Chua,et al.  Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval , 2020, SIGIR.

[3]  Chong-Wah Ngo,et al.  Compact Bilinear Augmented Query Structured Attention for Sport Highlights Classification , 2020, ACM Multimedia.

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tat-Seng Chua,et al.  Relation Understanding in Videos: A Grand Challenge Overview , 2019, ACM Multimedia.

[7]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hanwang Zhang,et al.  Deconfounded Image Captioning: A Causal Retrospect , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tat-Seng Chua,et al.  Multiple Hypothesis Video Relation Detection , 2019, 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM).

[10]  Chong-Wah Ngo,et al.  Person-level Action Recognition in Complex Events via TSD-TSM Networks , 2020, ACM Multimedia.

[11]  Michael Felsberg,et al.  Accurate Scale Estimation for Robust Visual Tracking , 2014, BMVC.

[12]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Tat-Seng Chua,et al.  Video Relation Detection via Multiple Hypothesis Association , 2020, ACM Multimedia.

[14]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Joohee Kim,et al.  Modeling Long- and Short-Term Temporal Context for Video Object Detection , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[16]  Liqiang Nie,et al.  Neural Multimodal Cooperative Learning Toward Micro-Video Understanding , 2020, IEEE Transactions on Image Processing.

[17]  Shaohua Wan,et al.  Exploring Deep Learning for View-Based 3D Model Retrieval , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[18]  Shiliang Pu,et al.  Video Relation Detection with Spatio-Temporal Graph , 2019, ACM Multimedia.

[19]  Chong-Wah Ngo,et al.  Cross-modal Recipe Retrieval with Rich Food Attributes , 2017, ACM Multimedia.

[20]  Menglong Zhu,et al.  Looking Fast and Slow: Memory-Guided Mobile Video Object Detection , 2019, ArXiv.

[21]  Tao Mei,et al.  VrR-VG: Refocusing Visually-Relevant Relationships , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[23]  Gangshan Wu,et al.  Video Visual Relation Detection via Multi-modal Feature Fusion , 2019, ACM Multimedia.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sheng Tang,et al.  Visual Relation Grounding in Videos , 2020, ECCV.

[27]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Angela Yao,et al.  NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tat-Seng Chua,et al.  Video Visual Relation Detection , 2017, ACM Multimedia.

[31]  Ali Farhadi,et al.  Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[33]  Liqiang Nie,et al.  Scalable Deep Hashing for Large-Scale Social Image Retrieval , 2020, IEEE Transactions on Image Processing.

[34]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Meng Wang,et al.  Person Reidentification via Structural Deep Metric Learning , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Huaxiang Zhang,et al.  Deep Collaborative Multi-View Hashing for Large-Scale Image Search , 2020, IEEE Transactions on Image Processing.

[38]  Yu Cao,et al.  Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[39]  Meng Jian,et al.  Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts , 2020, ACM Multimedia.

[40]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Heyan Huang,et al.  VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference , 2022, IEEE transactions on circuits and systems for video technology (Print).

[43]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Meng Wang,et al.  Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.

[45]  Chong-Wah Ngo,et al.  Deep-based Ingredient Recognition for Cooking Recipe Retrieval , 2016, ACM Multimedia.

[46]  François Plesse,et al.  Modelling relations with prototypes for visual relation detection , 2020, Multimedia Tools and Applications.

[47]  Shizhe Chen,et al.  Relation Understanding in Videos , 2019, ACM Multimedia.

[48]  William J. Dally,et al.  CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video , 2019, MLSys.

[49]  Chong-Wah Ngo,et al.  Neighbourhood Structure Preserving Cross-Modal Embedding for Video Hyperlinking , 2020, IEEE Transactions on Multimedia.

[50]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[52]  Hao Luo,et al.  Detect or Track: Towards Cost-Effective Video Object Detection/Tracking , 2018, AAAI.

[53]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[54]  Tat-Seng Chua,et al.  Interpretable Fashion Matching with Rich Attributes , 2019, SIGIR.

[55]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Yadong Mu,et al.  Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[58]  Hanwang Zhang,et al.  Interventional Few-Shot Learning , 2020, NeurIPS.

[59]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Weili Guan,et al.  A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2 , 2020, IEEE Transactions on Image Processing.

[61]  Xiangnan He,et al.  MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video , 2019, ACM Multimedia.