UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Action recognition based on skeleton data has recently witnessed increasing attention and progress. State-of-theart approaches adopting Graph Convolutional networks (GCNs) can effectively extract features on human skeletons relying on the pre-defined human topology. Despite associated progress, GCN-based methods have difficulties to generalize across domains, especially with different human topological structures. In this context, we introduce UNIK*, a novel skeleton-based action recognition method that is not only effective to learn spatio-temporal features on human skeleton sequences but also able to generalize across datasets. This is achieved by learning an optimal dependency matrix from the uniform distribution based on a multi-head attention mechanism. Subsequently, to study the cross-domain generalizability of skeleton-based action recognition in real-world videos, we re-evaluate state-ofthe-art approaches as well as the proposed UNIK in light of a novel Posetics dataset. This dataset is created from Kinetics-400 videos by estimating, refining and filtering poses. We provide an analysis on how much performance improves on smaller benchmark datasets after pre-training on Posetics for the action classification task. Experimen*Equal contribution. Code is available at: https://github.com/YangDi666/UNIK tal results show that the proposed UNIK, with pre-training on Posetics, generalizes well and outperforms state-of-theart when transferred onto four target action classification datasets: Toyota Smarthome, Penn Action, NTU-RGB+D 60 and NTU-RGB+D 120.

[1]  Yifan Zhang,et al.  Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks , 2019, IEEE Transactions on Image Processing.

[2]  Nanning Zheng,et al.  View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[4]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Tianjiao Li,et al.  UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Antitza Dantcheva,et al.  ImaGINator: Conditional Spatio-Temporal GAN for Video Generation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Jiaying Liu,et al.  Optimized Skeleton-based Action Recognition via Sparsified Graph Regression , 2018, ACM Multimedia.

[12]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[14]  Antitza Dantcheva,et al.  From Attribute-Labels to Faces: Face Generation Using a Conditional Generative Adversarial Network , 2018, ECCV Workshops.

[15]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  David Picard,et al.  Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition , 2020, IEEE transactions on pattern analysis and machine intelligence.

[17]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[18]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  William Robson Schwartz,et al.  Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints , 2019, 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[20]  Zhenghao Chen,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Hassen Drira,et al.  Sparse Coding of Shape Trajectories for Facial Expression and Action Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xiaopeng Hong,et al.  Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching , 2019, AAAI.

[29]  Bingbing Ni,et al.  3D Human Action Representation Learning via Cross-View Consistency Pursuit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[31]  Gianpiero Francesca,et al.  Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos , 2020, ArXiv.

[32]  Michal Koperski,et al.  Toyota Smarthome: Real-World Activities of Daily Living , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Ravi Kiran Sarvadevabhatla,et al.  Quo Vadis, Skeleton Action Recognition? , 2020, International Journal of Computer Vision.

[34]  Florian Schroff,et al.  View-Invariant Probabilistic Embedding for Human Pose , 2020, ECCV.

[35]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Antitza Dantcheva,et al.  InMoDeGAN: Interpretable Motion Decomposition Generative Adversarial Network for Video Generation , 2021, ArXiv.

[37]  Michael S. Ryoo,et al.  Recognizing Actions in Videos from Unseen Viewpoints , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Rui Dai,et al.  VPN: Learning Video-Pose Embedding for Activities of Daily Living , 2020, ECCV.

[39]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Michael S. Ryoo,et al.  AssembleNet++: Assembling Modality Representations via Attention Connections , 2020, ECCV.

[41]  Zhaoxiang Zhang,et al.  Relational Network for Skeleton-Based Action Recognition , 2018, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[42]  Chen Chen,et al.  Memory Attention Networks for Skeleton-Based Action Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Behrooz Mahasseni,et al.  Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.