ACGNet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization

Weakly-supervised temporal action localization (WTAL) in untrimmed videos has emerged as a practical but challenging task since only video-level labels are available. Existing approaches typically leverage off-the-shelf segment-level features, which suffer from spatial incompleteness and temporal incoherence, thus limiting their performance. In this paper, we tackle this problem from a new perspective by enhancing segment-level representations with a simple yet effective graph convolutional network, namely action complement graph network (ACGNet). It facilitates the current video segment to perceive spatial-temporal dependencies from others that potentially convey complementary clues, implicitly mitigating the negative effects caused by the two issues above. By this means, the segment-level features are more discriminative and robust to spatial-temporal variations, contributing to higher localization accuracies. More importantly, the proposed ACGNet works as a universal module that can be flexibly plugged into different WTAL frameworks, while maintaining the end-to-end training fashion. Extensive experiments are conducted on the THUMOS'14 and ActivityNet1.2 benchmarks, where the state-of-the-art results clearly demonstrate the superiority of the proposed approach.

[1]  Jingcheng Ni,et al.  Identity-aware Graph Memory Network for Action Detection , 2021, ACM Multimedia.

[2]  Di Huang,et al.  Human-Aware Coarse-to-Fine Online Action Detection , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Gang Hua,et al.  Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization , 2020, ECCV.

[4]  Zhanghui Kuang,et al.  Context-Aware RCNN: A Baseline for Action Detection in Videos , 2020, ECCV.

[5]  Kyle Min,et al.  Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization , 2020, ECCV.

[6]  Yan Lu,et al.  Background Modeling via Uncertainty Estimation for Weakly-supervised Action Localization , 2020, ArXiv.

[7]  Yadong Mu,et al.  Learning Temporal Co-Attention Models for Unsupervised Video Action Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yoichi Sato,et al.  Improving Action Segmentation via Graph-Based Temporal Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cees G. M. Snoek,et al.  ActionBytes: Learning From Trimmed Videos to Localize Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Bin Kang,et al.  TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yan Huang,et al.  Relational Prototypical Network for Weakly Supervised Temporal Action Localization , 2020, AAAI.

[12]  Trevor Darrell,et al.  Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning , 2020, ECCV.

[13]  Yadong Mu,et al.  Weakly-Supervised Action Localization by Generative Attention Modeling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yong Jae Lee,et al.  Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Luc Van Gool,et al.  stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Yunhong Wang,et al.  A Joint Framework for Athlete Tracking and Action Recognition in Sports Videos , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Ali K. Thabet,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Feiyue Huang,et al.  TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.

[19]  Youngjung Uh,et al.  Background Suppression Network for Weakly-supervised Temporal Action Localization , 2020, ArXiv.

[20]  Zhifeng Li,et al.  Boundary-Aware Cascade Networks for Temporal Action Segmentation , 2020, ECCV.

[21]  Tao Mei,et al.  Long Short-Term Relation Networks for Video Action Detection , 2019, ACM Multimedia.

[22]  Gang Hua,et al.  Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Ning Xu,et al.  Temporal Structure Mining for Weakly Supervised Action Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Charless C. Fowlkes,et al.  Weakly-Supervised Action Localization With Background Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Futai Zou,et al.  Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization , 2019, ACM Multimedia.

[27]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Tao Mei,et al.  Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Daochang Liu,et al.  Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yueming Lyu,et al.  Marginalized Average Attentional Network for Weakly-Supervised Learning , 2019, ICLR.

[31]  Jan Kautz,et al.  STEP: Spatio-Temporal Progressive Learning for Video Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[35]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[36]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[37]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[38]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[41]  Bingbing Ni,et al.  Zero-Shot Action Recognition with Error-Correcting Output Codes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[44]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[47]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[48]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[49]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.