Temporal Transformer Networks With Self-Supervision for Action Recognition

In recent years, 2D Convolutional Networksbased video action recognition has encouragingly gained wide popularity; However, constrained by the lack of long-range non-linear temporal relation modeling and reverse motion information modeling, the performance of existing models is, therefore, undercut seriously. To address this urgent problem, we introduce a startling Temporal Transformer Network with Self-supervision (TTSN). Our high-performance TTSN mainly consists of a temporal transformer module and a temporal sequence self-supervision module. Concisely speaking, we utilize the efficient temporal transformer module to model the nonlinear temporal dependencies among non-local frames, which significantly enhances complex motion feature representations. The temporal sequence self-supervision module we employ unprecedentedly adopts the streamlined strategy of “random batch random channel” to reverse the sequence of video frames, allowing robust extractions of motion information representation from inversed temporal dimensions and improving the generalization capability of the model. Extensive experiments on three widely used datasets (HMDB51, UCF101, and Something-something V1) have conclusively demonstrated that our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.

[1]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[3]  Patrick Pérez,et al.  Boosting Few-Shot Visual Learning With Self-Supervision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[5]  D. Rueckert,et al.  Self-Supervision with Superpixels: Training Few-shot Medical Image Segmentation without Annotation , 2020, ECCV.

[6]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[7]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yunde Jia,et al.  Content-Attention Representation by Factorized Action-Scene Network for Action Recognition , 2018, IEEE Transactions on Multimedia.

[9]  Bin Kang,et al.  TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yali Wang,et al.  SmallBigNet: Integrating Core and Contextual Views for Video Classification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Feiyue Huang,et al.  TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  Limin Wang,et al.  TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, ArXiv.

[14]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Xin Ma,et al.  Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos , 2020, IEEE Transactions on Multimedia.

[16]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Quanfu Fan,et al.  More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation , 2019, NeurIPS.

[18]  Yang Du,et al.  Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Qi Tian,et al.  Universal-to-Specific Framework for Complex Action Recognition , 2020, IEEE Transactions on Multimedia.

[21]  Shijian Lu,et al.  TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Dapeng Zhao,et al.  Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes , 2021, CGI.

[25]  Chen Change Loy,et al.  Knowledge Distillation Meets Self-Supervision , 2020, ECCV.

[26]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Jun Cheng,et al.  A Cuboid CNN Model With an Attention Mechanism for Skeleton-Based Action Recognition , 2020, IEEE Transactions on Multimedia.

[28]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Aljosa Smolic,et al.  ACTION-Net: Multipath Excitation for Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  William A. P. Smith,et al.  "Look Ma, No Landmarks!" - Unsupervised, Model-Based Dense Face Alignment , 2020, ECCV.

[31]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[32]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[33]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[34]  Luc Van Gool,et al.  Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[35]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alan Yuille,et al.  Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[38]  Xilin Chen,et al.  Interlaced Sparse Self-Attention for Semantic Segmentation , 2019, ArXiv.

[39]  Xudong Jiang,et al.  Temporal Distinct Representation Learning for Action Recognition , 2020, ECCV.

[40]  Narendra Ahuja,et al.  Unsupervised Visual Representation Learning by Graph-Based Consistent Constraints , 2016, ECCV.

[41]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Liang Zheng,et al.  CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions , 2020, ECCV.

[43]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Xianglong Liu,et al.  Spatio-temporal deformable 3D ConvNets with attention for action recognition , 2020, Pattern Recognit..

[45]  Pong C. Yuen,et al.  Self-supervised Temporal Discriminative Learning for Video Representation Learning , 2020, ArXiv.

[46]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Jingdong Wang,et al.  OCNet: Object Context Network for Scene Parsing , 2018, ArXiv.

[48]  Heng Wang,et al.  Video Modeling With Correlation Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[50]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Nicu Sebe,et al.  Spatio-Temporal Attention Networks for Action Recognition and Detection , 2020, IEEE Transactions on Multimedia.

[52]  Jonathan T. Barron,et al.  What Matters in Unsupervised Optical Flow , 2020, ECCV.

[53]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[54]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[55]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  C. Qian,et al.  TAM: Temporal Adaptive Module for Video Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).