Benchmarking Data Efficiency and Computational Efficiency of Temporal Action Localization Models

In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of-the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture.

[1]  Tuan N. Tang,et al.  TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization , 2023, ArXiv.

[2]  Yujie Zhong,et al.  TriDet: Temporal Action Detection with Relative Boundary Modeling , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xiatian Zhu,et al.  Zero-Shot Temporal Action Detection via Vision-Language Prompting , 2022, ECCV.

[4]  Yin Li,et al.  ActionFormer: Localizing Moments of Actions with Transformers , 2022, ECCV.

[5]  Tao Xiang,et al.  Few-Shot Temporal Action Localization with Query Adaptive Transformer , 2021, BMVC.

[6]  Tongliang Liu,et al.  KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization , 2021, IEEE Transactions on Image Processing.

[7]  Shiwei Zhang,et al.  End-to-End Temporal Action Detection With Transformer , 2021, IEEE Transactions on Image Processing.

[8]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[9]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[10]  Bernard Ghanem,et al.  TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[11]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[12]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[13]  Cees G. M. Snoek,et al.  Localizing the Common Action Among a Few Videos , 2020, ECCV.

[14]  K. Keutzer,et al.  Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ICML.

[15]  D. Damen,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.

[16]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[17]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Yadong Mu,et al.  Scale Matters: Temporal Scale Aggregation Network For Precise Action Localization In Untrimmed Videos , 2019, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[19]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[21]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[22]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[27]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[28]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  A. Agrawal,et al.  A survey on activity recognition and behavior understanding in video surveillance , 2013, The Visual Computer.

[30]  Anupam Agrawal,et al.  A survey on activity recognition and behavior understanding in video surveillance , 2012, The Visual Computer.

[31]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  D. J. Swanson,et al.  Videos , 1998, International Journal of Impotence Research.

[33]  Yongzhao Zhan,et al.  A Survey on Temporal Action Localization , 2020, IEEE Access.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Wenpeng Yin,et al.  Summarization , 2018, Encyclopedia of Database Systems.