Self-supervised Pretraining with Classification Labels for Temporal Activity Detection

Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions as done in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained on large-scale classification datasets (e.g., Kinetics-400). However, such pretrained models are not ideal for downstream detection performance due to the disparity between the pretraining and the downstream fine-tuning tasks. This work proposes a novel self-supervised pretraining method for detection leveraging classification labels to mitigate such disparity by introducing frame-level pseudo labels, multi-action frames, and action segments. We show that the models pretrained with the proposed self-supervised detection task outperform prior work on multiple challenging activity detection benchmarks, including Charades and MultiTHUMOS. Our extensive ablations further provide insights on when and how to use the proposed models for activity detection. Code and models will be released online.

[1]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael S. Ryoo,et al.  Coarse-Fine Networks for Temporal Activity Detection in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Daochang Liu,et al.  Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[15]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cees G. M. Snoek,et al.  ActionBytes: Learning From Trimmed Videos to Localize Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gang Hua,et al.  Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ramakant Nevatia,et al.  Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images , 2015, ACM Multimedia.

[23]  Sid Ying-Ze Bao,et al.  Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yadong Mu,et al.  Learning Temporal Co-Attention Models for Unsupervised Video Action Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Abhinav Gupta,et al.  Aligning Videos in Space and Time , 2020, ECCV.

[26]  Fadime Sener,et al.  Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[28]  Serge J. Belongie,et al.  Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[30]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Charless C. Fowlkes,et al.  Weakly-Supervised Action Localization With Background Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Larry S. Davis,et al.  Stacked Spatio-Temporal Graph Convolutional Networks for Action Segmentation , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34]  Longlong Jing,et al.  Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction. , 2018, 1811.11387.

[35]  Allan Jabri,et al.  Space-Time Correspondence as a Contrastive Random Walk , 2020, NeurIPS.

[36]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[38]  Irfan A. Essa,et al.  Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[41]  Nils Thuerey,et al.  Learning temporal coherence via self-supervision for GAN-based video generation , 2020, ACM Trans. Graph..

[42]  Junying Chen,et al.  UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Qihang Yu,et al.  Can Temporal Information Help with Contrastive Self-Supervised Learning? , 2020, ArXiv.

[44]  Ali K. Thabet,et al.  Video Self-Stitching Graph Network for Temporal Action Localization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Lorenzo Garattoni,et al.  PDAN: Pyramid Dilated Attention Network for Action Detection , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Juan Carlos Niebles,et al.  Learning Temporal Action Proposals With Fewer Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  In-So Kweon,et al.  Learning Image Representations by Completing Damaged Jigsaw Puzzles , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[50]  Behnam Neyshabur,et al.  Exploring the Limits of Large Scale Pre-training , 2021, ICLR.

[51]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[52]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Xinlei Chen,et al.  Grounded Video Description , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Juergen Gall,et al.  Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[56]  Philip Bachman,et al.  Pretraining Representations for Data-Efficient Reinforcement Learning , 2021, NeurIPS.

[57]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[59]  Michael S. Ryoo,et al.  Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[60]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[61]  Roberto Cipolla,et al.  Fast-SCNN: Fast Semantic Segmentation Network , 2019, BMVC.

[62]  Yadong Mu,et al.  Weakly-Supervised Action Localization by Generative Attention Modeling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[65]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[66]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Tomokazu Murakami,et al.  Cycle-Contrast for Self-Supervised Video Representation Learning , 2020, NeurIPS.

[68]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[69]  Andrew Zisserman,et al.  Broaden Your Views for Self-Supervised Video Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Megha Nawhal,et al.  Activity Graph Transformer for Temporal Action Localization , 2021, ArXiv.

[71]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[73]  Gangshan Wu,et al.  Relaxed Transformer Decoders for Direct Action Proposal Generation , 2021, ArXiv.

[74]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[76]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[77]  René Vidal,et al.  Representation Learning on Visual-Symbolic Graphs for Video Understanding , 2020, ECCV.

[78]  Fadime Sener,et al.  Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[80]  Michael S. Ryoo,et al.  Learning Latent Super-Events to Detect Multiple Activities in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[81]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[82]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[83]  Ivan Laptev,et al.  Learning Actionness via Long-Range Temporal Order Verification , 2020, ECCV.

[84]  Jia Deng,et al.  How Useful Is Self-Supervised Pretraining for Visual Tasks? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Shuning Chang,et al.  Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation , 2021, HCMA@MM.

[86]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[87]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).