Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[6]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[7]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[14]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lior Wolf,et al.  RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[24]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[26]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[27]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[29]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Juergen Gall,et al.  Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[31]  Abhinav Gupta,et al.  What Actions are Needed for Understanding Human Actions in Videos? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[35]  Amit K. Roy-Chowdhury,et al.  Collaborative Summarization of Topic-Related Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Nojun Kwak,et al.  Motion Feature Network: Fixed Motion Filter for Action Recognition , 2018, ECCV.

[39]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Song Han,et al.  Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[41]  Dahua Lin,et al.  Trajectory Convolution for Action Recognition , 2018, NeurIPS.

[42]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[43]  Luc Van Gool,et al.  Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[44]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[47]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[48]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[49]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[50]  Xiaoyan Sun,et al.  MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Juan Carlos Niebles,et al.  What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Arnold W. M. Smeulders,et al.  Timeception for Complex Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Michael S. Ryoo,et al.  Representation Flow for Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Alan Yuille,et al.  Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Lorenzo Torresani,et al.  DistInit: Learning Video Representations Without a Single Labeled Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Joanna Materzynska,et al.  The Jester Dataset: A Large-Scale Video Dataset of Human Gestures , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[63]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[65]  Chao Li,et al.  Collaborative Spatiotemporal Feature Learning for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Davide Modolo,et al.  Action Recognition With Spatial-Temporal Discriminative Filter Banks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Bolei Zhou,et al.  Temporal Pyramid Network for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Juan Carlos Niebles,et al.  RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition , 2020, ECCV.

[72]  Heng Wang,et al.  Video Modeling With Correlation Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Jeremy Kepner,et al.  Accuracy and Performance Comparison of Video Action Recognition Approaches , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).

[74]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  O. Lanz,et al.  Gate-Shift Networks for Video Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[77]  Xinyu Li,et al.  Directional Temporal Modeling for Action Recognition , 2020, ECCV.

[78]  Xudong Jiang,et al.  Temporal Distinct Representation Learning for Action Recognition , 2020, ECCV.

[79]  Feiyue Huang,et al.  TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.

[80]  Lorenzo Torresani,et al.  Only Time Can Tell: Discovering Temporal Data for Temporal Modeling , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).