Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition

The self-supervised pretraining paradigm has achieved great success in skeleton-based action recognition. However, these methods treat the motion and static parts equally, and lack an adaptive design for different parts, which has a negative impact on the accuracy of action recognition. To realize the adaptive action modeling of both parts, we propose an Actionlet-Dependent Contrastive Learning method (ActCLR). The actionlet, defined as the discriminative subset of the human skeleton, effectively decomposes motion regions for better action modeling. In detail, by contrasting with the static anchor without motion, we extract the motion region of the skeleton data, which serves as the actionlet, in an unsupervised manner. Then, centering on actionlet, a motion-adaptive data transformation method is built. Different data transformations are applied to action let and non-actionlet regions to introduce more diversity while maintaining their own characteristics. Meanwhile, we propose a semantic-aware feature pooling method to build feature representations among motion and static regions in a distinguished manner. Extensive experiments on NTU RGB+D and PKUMMD show that the proposed method achieves remarkable action recognition performance. More visualization and quantitative experiments demonstrate the effectiveness of our method. Our project website is available at https://langlandslin.github.io/projects/ActCLR/

[1]  Wen-gang Zhou,et al.  CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation , 2022, ECCV.

[2]  Haoyuan Zhang,et al.  Contrastive Positive Mining for Unsupervised 3D Action Representation Learning , 2022, ECCV.

[3]  Dimitris N. Metaxas,et al.  Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning , 2022, ECCV.

[4]  H. Chang,et al.  Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning , 2022, ECCV.

[5]  Zhengyang Chen,et al.  Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition , 2022, ArXiv.

[6]  Cihang Xie,et al.  A Simple Data Mixing Prior for Improving Self-Supervised Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Runwei Ding,et al.  Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition , 2021, AAAI.

[8]  Guosheng Lin,et al.  Self-supervised 3D Skeleton Action Representation Learning with Motion Consistency and Continuity , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Hazel Doughty,et al.  Skeleton-Contrastive 3D Action Representation Learning , 2021, ACM Multimedia.

[10]  Shijian Lu,et al.  Skeleton Cloud Colorization for Unsupervised 3D Action Representation Learning , 2021, IEEE International Conference on Computer Vision.

[11]  Weiming Hu,et al.  Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Bingbing Ni,et al.  3D Human Action Representation Learning via Cross-View Consistency Pursuit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Wenhan Yang,et al.  MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition , 2020, ACM Multimedia.

[14]  Zhang Zhang,et al.  Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition , 2020, ACM Multimedia.

[15]  Xiping Hu,et al.  Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition , 2020, Inf. Sci..

[16]  Yunhui Liu,et al.  Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement , 2020, ECCV.

[17]  Dacheng Tao,et al.  Context Aware Graph Convolution for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jiaying Liu,et al.  A Benchmark Dataset and Comparison Study for Multi-modal Human Action Analytics , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[19]  Zhiyong Wang,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[21]  Xiulong Liu,et al.  PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xiaopeng Hong,et al.  Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching , 2019, AAAI.

[24]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[25]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[26]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Shih-Fu Chang,et al.  Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Nanning Zheng,et al.  Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  R. Devon Hjelm,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[32]  Nanning Zheng,et al.  Adding Attentiveness to the Neurons in Recurrent Neural Networks , 2018, ECCV.

[33]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[34]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jianhua Dai,et al.  Unsupervised Representation Learning With Long-Term Dynamics for Skeleton Based Action Recognition , 2018, AAAI.

[37]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[38]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[39]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[40]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[42]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[43]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yoshihiko Nakamura,et al.  Motion Recognition Employing Multiple Kernel Learning of Fisher Vectors Using Local Skeleton Features , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[46]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[47]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[51]  Whitney K. Newey,et al.  Adaptive estimation of regression models via moment restrictions , 1988 .

[52]  David S. Rosenberg,et al.  Feature Extraction , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[53]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .