Normalized Human Pose Features for Human Action Video Alignment

We present a novel approach for extracting human pose features from human action videos. The goal is to let the pose features capture only the poses of the action while being invariant to other factors, including video back-grounds, the video subjects’ anthropometric characteristics and viewpoints. Such human pose features facilitate the comparison of pose similarity and can be used for down-stream tasks, such as human action video alignment and pose retrieval. The key to our approach is to first normalize the poses in the video frames by mapping the poses onto a pre-defined 3D skeleton to not only disentangle subject physical features, such as bone lengths and ratios, but also to unify global orientations of the poses. Then the normalized poses are mapped to a pose embedding space of high-level features, learned via unsupervised metric learning. We evaluate the effectiveness of our normalized features both qualitatively by visualizations, and quantitatively by a video alignment task on the Human3.6M dataset and an action recognition task on the Penn Action dataset.

[1]  Qifeng Chen,et al.  Self-supervised Dance Video Synthesis Conditioned on Music , 2020, ACM Multimedia.

[2]  Christian Theobalt,et al.  PhysCap , 2020, ACM Trans. Graph..

[3]  Hujun Bao,et al.  Motion Capture from Internet Videos , 2020, ECCV.

[4]  Andreas Aristidou,et al.  MotioNet , 2020, ACM Trans. Graph..

[5]  Siyu Zhu,et al.  End-to-End Learning Local Multi-View Descriptors for 3D Point Clouds , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yichen Wei,et al.  Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ting Liu,et al.  View-Invariant Probabilistic Embedding for Human Pose , 2019, ECCV.

[9]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Juan Carlos Niebles,et al.  Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dani Lischinski,et al.  Learning character-agnostic motion for motion retargeting in 2D , 2019, ACM Trans. Graph..

[12]  Jonathan Tompson,et al.  Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yu Tian,et al.  Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Christian Theobalt,et al.  In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chen Qian,et al.  Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  R. Venkatesh Babu,et al.  Unsupervised Feature Learning of Human Actions As Trajectories in Pose Embedding Manifold , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18]  Daniel Cohen-Or,et al.  Deep motifs and motion signatures , 2018, ACM Trans. Graph..

[19]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Frédéric H. Pighin,et al.  University of Wisconsin-Madison , 2018, State Governments and Research Universities.

[21]  Pietro Perona,et al.  It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data , 2018, BMVC.

[22]  Xiaowei Zhou,et al.  Ordinal Depth Supervision for 3D Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[24]  Ruben Villegas,et al.  Neural Kinematic Networks for Unsupervised Motion Retargetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[26]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[28]  Björn Ommer,et al.  Self-Supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Stephan Günnemann,et al.  Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking , 2017, ICLR.

[30]  Jiri Matas,et al.  Working hard to know your neighbor's margins: Local descriptor learning loss , 2017, NIPS.

[31]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[33]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Yichen Wei,et al.  Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[36]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Basura Fernando,et al.  Unsupervised Human Action Detection by Action Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[40]  Yaser Sheikh,et al.  Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[42]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Greg Mori,et al.  Pose Embeddings: A Deep Architecture for Learning to Match Human Poses , 2015, ArXiv.

[44]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[49]  Eshed Ohn-Bar,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[50]  Jessica K. Hodgins,et al.  Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[52]  Takeo Igarashi,et al.  Retrieval and Visualization of Human Motion Data via Stick Figures , 2012, Comput. Graph. Forum.

[53]  Andrew Zisserman,et al.  2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images , 2012, International Journal of Computer Vision.

[54]  Fei Wu,et al.  Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor , 2011, IEEE Transactions on Visualization and Computer Graphics.

[55]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[56]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[57]  Aaron Hertzmann,et al.  Style-based inverse kinematics , 2004, ACM Trans. Graph..

[58]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[59]  Pascal Fua,et al.  XNect , 2019, ACM Trans. Graph..