Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Temporal action localization is crucial for understanding untrimmed videos. In this work, we first identify two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation. Then by presenting a novel network architecture and its training strategy, the two problems are explicitly looked into. Specifically, to model the completeness of actions, we propose a multi-branch neural network in which branches are enforced to discover distinctive action parts. Complete actions can be therefore localized by fusing activations from different branches. And to separate action instances from their surrounding context, we generate hard negative data for training using the prior that motionless video clips are unlikely to be actions. Experiments performed on datasets THUMOS'14 and ActivityNet show that our framework outperforms state-of-the-art methods. In particular, the average mAP on ActivityNet v1.2 is significantly improved from 18.0% to 22.4%. Our code will be released soon.

[1]  Richard P. Wildes,et al.  Review of Action Recognition and Detection Methods , 2016, ArXiv.

[2]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[5]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Zhi-Hua Zhou Multi-Instance Learning : A Survey , 2004 .

[8]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[14]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chen Sun,et al.  Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames , 2016, ECCV.

[16]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bernard Ghanem,et al.  SCC: Semantic Context Cascade for Efficient Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[22]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[24]  Ramakant Nevatia,et al.  Cascaded Boundary Regression for Temporal Action Detection , 2017, BMVC.

[25]  Tao Zhang,et al.  Step-by-step Erasion, One-by-one Collection: A Weakly Supervised Temporal Action Detector , 2018, ACM Multimedia.

[26]  Mubarak Shah,et al.  Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery , 2017, BMVC.

[27]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[29]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[30]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Bernard Ghanem,et al.  End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos , 2017, BMVC.

[34]  Sergio Escalera,et al.  A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[35]  Lei Zhang,et al.  AutoLoc: Weakly-supervised Temporal Action Localization , 2018, ECCV.

[36]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[37]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[41]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[42]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[44]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[46]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[47]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[48]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Tong Lu,et al.  Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Anupam Agrawal,et al.  A survey on activity recognition and behavior understanding in video surveillance , 2012, The Visual Computer.

[51]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[52]  Bernard Ghanem,et al.  Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization , 2017, ECCV.

[53]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[54]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).