Modeling Sub-Actions for Weakly Supervised Temporal Action Localization

As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. Due to the usage of video-level category labels, this task is usually formulated as the task of classification, which always suffers from the contradiction between classification and detection. In this paper, we describe a novel approach to alleviate the contradiction for detecting more complete action instances by explicitly modeling sub-actions. Our method makes use of three innovations to model the latent sub-actions. First, our framework uses prototypes to represent sub-actions, which can be automatically learned in an end-to-end way. Second, we regard the relations among sub-actions as a graph, and construct the correspondences between sub-actions and actions by the graph pooling operation. Doing so not only makes the sub-actions inter-dependent to facilitate the multi-label setting, but also naturally use the video-level labels as weak supervision. Third, we devise three complementary loss functions, namely, representation loss, balance loss and relation loss to ensure the learned sub-actions are diverse and have clear semantic meanings. Experimental results on THUMOS14 and ActivityNet1.3 datasets demonstrate the effectiveness of our method and superior performance over state-of-the-art approaches.

[1]  Dong Xu,et al.  Progressive Cross-Stream Cooperation in Spatial and Temporal Domain for Action Localization , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Gang Hua,et al.  Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Fei Yin,et al.  Robust Classification with Convolutional Prototype Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Daniel Cremers,et al.  An Improved Algorithm for TV-L 1 Optical Flow , 2009, Statistical and Geometrical Approaches to Visual Motion Analysis.

[7]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[10]  Fei Wu,et al.  Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection , 2018, AAAI.

[11]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[12]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[14]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition with Joint Class-Aware Map Disentangling and Label Correlation Embedding , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[15]  Mubarak Shah,et al.  Real-World Anomaly Detection in Surveillance Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[17]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[20]  Liang Wang,et al.  Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[21]  Wenyu Liu,et al.  PCL: Proposal Cluster Learning for Weakly Supervised Object Detection , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Fabio Cuzzolin,et al.  Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge , 2016, ArXiv.

[23]  Gong Cheng,et al.  Progressive Contextual Instance Refinement for Weakly Supervised Object Detection in Remote Sensing Images , 2020, IEEE Transactions on Geoscience and Remote Sensing.

[24]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[25]  Ling Shao,et al.  3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Kaiqi Huang,et al.  Adversarially Occluded Samples for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Cordelia Schmid,et al.  Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[30]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yong Dou,et al.  Exploring Temporal Preservation Networks for Precise Temporal Action Localization , 2017, AAAI.

[32]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[33]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yueming Lyu,et al.  Marginalized Average Attentional Network for Weakly-Supervised Learning , 2019, ICLR.

[35]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Gong Cheng,et al.  High-Quality Proposals for Weakly Supervised Object Detection , 2020, IEEE Transactions on Image Processing.

[37]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[39]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  Tong Lu,et al.  Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[44]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[46]  Daochang Liu,et al.  Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Futai Zou,et al.  Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization , 2019, ACM Multimedia.

[48]  Gong Cheng,et al.  Automatic Weakly Supervised Object Detection From High Spatial Resolution Remote Sensing Images via Dynamic Curriculum Learning , 2021, IEEE Trans. Geosci. Remote. Sens..

[49]  Bernard Ghanem,et al.  Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization , 2017, ECCV.

[50]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[51]  Mubarak Shah,et al.  Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery , 2017, BMVC.

[52]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Charless C. Fowlkes,et al.  Weakly-Supervised Action Localization With Background Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Silvio Savarese,et al.  Action Recognition by Hierarchical Mid-Level Action Elements , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Jianping Fan,et al.  Correlative multi-label multi-instance image annotation , 2011, 2011 International Conference on Computer Vision.

[56]  Ashraful Islam,et al.  Weakly Supervised Temporal Action Localization Using Deep Metric Learning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Jing Xu,et al.  Attention-Aware Compositional Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[61]  Yadong Mu,et al.  Weakly-Supervised Action Localization by Generative Attention Modeling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[64]  Youngjung Uh,et al.  Background Suppression Network for Weakly-supervised Temporal Action Localization , 2020, ArXiv.

[65]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[66]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[68]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[70]  Tao Mei,et al.  Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Tinne Tuytelaars,et al.  Weakly supervised object detection with convex clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Shuicheng Yan,et al.  Graph-Based Global Reasoning Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[75]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Sarah Parisot,et al.  Learning Conditioned Graph Structures for Interpretable Visual Question Answering , 2018, NeurIPS.

[77]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[78]  Chang Liu,et al.  C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[80]  Tao Zhang,et al.  Step-by-step Erasion, One-by-one Collection: A Weakly Supervised Temporal Action Detector , 2018, ACM Multimedia.

[81]  Jure Leskovec,et al.  Hierarchical Graph Representation Learning with Differentiable Pooling , 2018, NeurIPS.

[82]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[83]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[84]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[85]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[86]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  Yan Huang,et al.  Relational Prototypical Network for Weakly Supervised Temporal Action Localization , 2020, AAAI.

[88]  Ramakant Nevatia,et al.  Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images , 2015, ACM Multimedia.

[89]  Gang Hua,et al.  Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization , 2020, ECCV.

[90]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91]  Kyle Min,et al.  Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization , 2020, ECCV.

[92]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[93]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).