Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement

Learning a good 3D human pose representation is important for human pose related tasks, e.g. human 3D pose estimation and action recognition. Within all these problems, preserving the intrinsic pose information and adapting to view variations are two critical issues. In this work, we propose a novel Siamese denoising autoencoder to learn a 3D pose representation by disentangling the pose-dependent and view-dependent feature from the human skeleton data, in a fully unsupervised manner. These two disentangled features are utilized together as the representation of the 3D pose. To consider both the kinematic and geometric dependencies, a sequential bidirectional recursive network (SeBiReNet) is further proposed to model the human skeleton data. Extensive experiments demonstrate that the learned representation 1) preserves the intrinsic information of human pose, 2) shows good transferability across datasets and tasks. Notably, our approach achieves stateof-the-art performance on two inherently different tasks: pose denoising and unsupervised action recognition. Code and models are available at: https://github.com/NIEQiang001/unsupervised-human-pose.git.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Li Fei-Fei,et al.  Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xin Wang,et al.  View-Invariant Human Action Recognition Based on a 3D Bio-Constrained Skeleton Model , 2019, IEEE Transactions on Image Processing.

[4]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[5]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Claire Cardie,et al.  Deep Recursive Neural Networks for Compositionality in Language , 2014, NIPS.

[7]  Jianhua Dai,et al.  Unsupervised Representation Learning With Long-Term Dynamics for Skeleton Based Action Recognition , 2018, AAAI.

[8]  Xiaogang Wang,et al.  Fashion Landmark Detection in the Wild , 2016, ECCV.

[9]  Qiang Ji,et al.  Bayesian Graph Convolution LSTM for Skeleton Based Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Dani Lischinski,et al.  Learning character-agnostic motion for motion retargeting in 2D , 2019, ACM Trans. Graph..

[11]  Kai Liu,et al.  Tensor-based linear dynamical systems for action recognition from 3D skeletons , 2018, Pattern Recognit..

[12]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[13]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[14]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Yichen Wei,et al.  Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Yichen Wei,et al.  Weakly-supervised Transfer for 3D Human Pose Estimation in the Wild , 2017, ArXiv.

[17]  Cheng Li,et al.  Delving Deep Into Hybrid Annotations for 3D Human Recovery in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[19]  Yonghong Song,et al.  Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[20]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[21]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  R. Venkatesh Babu,et al.  Unsupervised Feature Learning of Human Actions As Trajectories in Pose Embedding Manifold , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[29]  Mohan S. Kankanhalli,et al.  Unsupervised Learning of View-invariant Action Representations , 2018, NeurIPS.

[30]  Liang Wang,et al.  Learning content and style: Joint action recognition and person identification from human skeletons , 2018, Pattern Recognit..

[31]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[33]  Björn E. Ottersten,et al.  Pose Encoding for Robust Skeleton-Based Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Christopher D. Manning,et al.  Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks , 2010 .

[35]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Yu Tian,et al.  Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).