Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization, which aims at temporally locating action instances in untrimmed videos using only video-level class labels during training, is an important yet challenging problem in video analysis. Many current methods adopt the “localization by classification” framework: first do video classification, then locate temporal area contributing to the results most. However, this framework fails to locate the entire action instances and gives little consideration to the local context. In this paper, we present a novel architecture called Cascaded Pyramid Mining Network (CPMN) to address these issues using two effective modules. First, to discover the entire temporal interval of specific action, we design a two-stage cascaded module with proposed Online Adversarial Erasing (OAE) mechanism, where new and complementary regions are mined through feeding the erased feature maps of discovered regions back to the system. Second, to exploit hierarchical contextual information in videos and reduce missing detections, we design a pyramid module which produces a scale-invariant attention map through combining the feature maps from different levels. Final, we aggregate the results of two modules to perform action localization via locating high score areas in temporal Class Activation Sequence (CAS). Extensive experiments conducted on THUMOS14 and ActivityNet-1.3 datasets demonstrate the effectiveness of our method.

[1]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Wenyu Liu,et al.  Multiple Instance Detection Network with Online Instance Classifier Refinement , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[6]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xu Zhao,et al.  Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017 , 2017, ArXiv.

[8]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[9]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[10]  Bernard Ghanem,et al.  The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary , 2018, ArXiv.

[11]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[13]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Cordelia Schmid,et al.  The LEAR submission at Thumos 2014 , 2014 .

[16]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[19]  Xu Zhao,et al.  Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network , 2018, ICONIP.

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Ramakant Nevatia,et al.  Cascaded Boundary Regression for Temporal Action Detection , 2017, BMVC.

[22]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bernard Ghanem,et al.  SCC: Semantic Context Cascade for Efficient Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[26]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Tong Lu,et al.  Temporal Action Localization by Structured Maximal Sums , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[30]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[32]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[33]  Yao Zhao,et al.  Object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[37]  Limin Wang,et al.  A Pursuit of Temporal Accuracy in General Activity Detection , 2017, ArXiv.

[38]  Fabio Cuzzolin,et al.  Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge , 2016, ArXiv.

[39]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[41]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.