Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

[1]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yadong Mu,et al.  Weakly-Supervised Action Localization by Generative Attention Modeling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[4]  Weishi Zheng,et al.  MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection , 2020, ECCV.

[5]  Changsheng Li,et al.  Multi-Instance Multi-Label Action Recognition and Localization Based on Spatio-Temporal Pre-Trimming for Untrimmed Videos , 2020, AAAI.

[6]  Bernard Ghanem,et al.  TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[7]  Ling Shao,et al.  3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Fei Wu,et al.  Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection , 2018, AAAI.

[10]  Mubarak Shah,et al.  Real-World Anomaly Detection in Surveillance Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[12]  Ashraful Islam,et al.  Weakly Supervised Temporal Action Localization Using Deep Metric Learning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Junsong Yuan,et al.  Pruning 3D Filters For Accelerating 3D ConvNets , 2020, IEEE Transactions on Multimedia.

[15]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Nicu Sebe,et al.  PAD-Net: Multi-tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Bolei Zhou,et al.  A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Youngjung Uh,et al.  Background Suppression Network for Weakly-supervised Temporal Action Localization , 2020, ArXiv.

[19]  Megha Nawhal,et al.  Activity Graph Transformer for Temporal Action Localization , 2021, ArXiv.

[20]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[22]  Kyle Min,et al.  Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization , 2020, ECCV.

[23]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Daochang Liu,et al.  Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[28]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[29]  Cees G. M. Snoek,et al.  ActionBytes: Learning From Trimmed Videos to Localize Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Fa-Ting Hong,et al.  MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[32]  Jian-Huang Lai,et al.  Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos , 2020, ACM Multimedia.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Liang Wang,et al.  Cross-Modal Cross-Domain Moment Alignment Network for Person Search , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Gourab Kundu,et al.  SF-Net: Single-Frame Supervision for Temporal Action Localization , 2020, ECCV.

[36]  Dima Damen,et al.  Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[37]  Hyunjung Shim,et al.  Attention-Based Dropout Layer for Weakly Supervised Object Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[39]  Wenwu Zhu,et al.  Learning Compact Hash Codes for Multimodal Representations Using Orthogonal Deep Structure , 2015, IEEE Transactions on Multimedia.

[40]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[41]  Gang Hua,et al.  Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization , 2020, ECCV.

[42]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[43]  Nicu Sebe,et al.  Learning Cross-Modal Deep Representations for Robust Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Tao Xiang,et al.  Boundary-sensitive Pre-training for Temporal Localization in Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Nicu Sebe,et al.  Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.

[47]  Hyeran Byun,et al.  Weakly-supervised Temporal Action Localization by Uncertainty Modeling , 2020, AAAI.

[48]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Yadong Mu,et al.  Learning Temporal Co-Attention Models for Unsupervised Video Action Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Gang Hua,et al.  ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization , 2021, AAAI.

[51]  Bernard Ghanem,et al.  RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[52]  Chengjiang Long,et al.  A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization , 2021, AAAI.