论文信息 - Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.

[1] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2] Ling Shao,et al. 3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4] Xudong Lin,et al. DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Gang Hua,et al. Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Shilei Wen,et al. BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Daochang Liu,et al. Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Youngjung Uh,et al. Background Suppression Network for Weakly-supervised Temporal Action Localization , 2020, ArXiv.

[9] Charless C. Fowlkes,et al. Weakly-Supervised Action Localization With Background Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[11] Lei Zhang,et al. AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[12] Tong Lu,et al. Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[15] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[16] Amit K. Roy-Chowdhury,et al. W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[17] Ning Xu,et al. Temporal Structure Mining for Weakly Supervised Action Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Larry S. Davis,et al. Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Rahul Sukthankar,et al. Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Bohyung Han,et al. Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22] Bernard Ghanem,et al. RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23] Cordelia Schmid,et al. MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[25] Luc Van Gool,et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] R. Nevatia,et al. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[29] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30] Yong Jae Lee,et al. Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31] Shih-Fu Chang,et al. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Du Q. Huynh,et al. Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Tao Mei,et al. Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Runhao Zeng,et al. Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38] Junbin Gao,et al. Learning graph structure for multi-label image classification via clique generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[41] Michael S. Ryoo,et al. Representation Flow for Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Ramakant Nevatia,et al. Cascaded Boundary Regression for Temporal Action Detection , 2017, BMVC.

[43] Nanning Zheng,et al. Action Coherence Network for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[44] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Dahua Lin,et al. Recognize Actions by Disentangling Components of Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Bernard Ghanem,et al. SCC: Semantic Context Cascade for Efficient Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).