Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats

Deep learning-based 3D human pose estimation performs best when trained on large amounts of labeled data, making combined learning from many datasets an important research direction. One obstacle to this endeavor are the different skeleton formats provided by different datasets, i.e., they do not label the same set of anatomical landmarks. There is little prior research on how to best supervise one model with such discrepant labels. We show that simply using separate output heads for different skeletons results in inconsistent depth estimates and insufficient information sharing across skeletons. As a remedy, we propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks. The discovered latent 3D points capture the redundancy among skeletons, enabling enhanced information sharing when used for consistency regularization. Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model, which outperforms prior work on a range of benchmarks, including the challenging 3D Poses in the Wild (3DPW) dataset. Our code and models are available for research purposes.1

[1]  Bo Wang,et al.  Dual Networks Based 3D Multi-Person Pose Estimation From Monocular Video , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Tao Mei,et al.  Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective , 2021, ArXiv.

[3]  Silvio Savarese,et al.  JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Selen Hande Kabil,et al.  Autoencoders reloaded , 2022, Biological Cybernetics.

[5]  Michael J. Black,et al.  Capturing and Inferring Dense Full-Body Human-Scene Contact , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Bharat Lal Bhatnagar,et al.  BEHAVE: Dataset and Method for Tracking Human Object Interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  R. Venkatesh Babu,et al.  Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Cristian Sminchisescu,et al.  HSPACE: Synthetic Parametric Humans Animated in Complex Environments , 2021, ArXiv.

[9]  Michael J. Black,et al.  SPEC: Seeing People in the Wild with an Estimated Camera , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Dejun Zhang,et al.  Deep Learning Methods for 3D Human Pose Estimation under Different Supervision Paradigms: A Survey , 2021, Electronics.

[11]  Juan C. Vera,et al.  A Guide for Sparse PCA: Model Comparison and Applications , 2021, Psychometrika.

[12]  Cristian Sminchisescu,et al.  AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ayoub Al-Hamadi,et al.  A Baseline for Cross-Database 3D Human Pose Estimation , 2021, Sensors.

[14]  Diane Henty,et al.  Early Access , 2021, Child and Adolescent Mental Health.

[15]  Jiashi Feng,et al.  PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Joachim Tesch,et al.  AGORA: Avatars in Geography Optimized for Regression Analysis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Noah Snavely,et al.  KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Robby T. Tan,et al.  Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lijuan Wang,et al.  Mesh Graphormer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Quoc V. Le,et al.  EfficientNetV2: Smaller Models and Faster Training , 2021, ICML.

[21]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Cristian Sminchisescu,et al.  Learning Complex 3D Human Self-Contact , 2020, AAAI.

[23]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bodo Rosenhahn,et al.  CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael J. Black,et al.  Monocular, One-stage, Regression of Multiple 3D People , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  B. Leibe,et al.  MeTRAbs: Metric-Scale Truncation-Robust Heatmaps for Absolute 3D Human Pose Estimation , 2020, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[27]  Stephen Gould,et al.  The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[29]  Nikolaus F. Troje,et al.  MoVi: A large multi-purpose human motion and video dataset , 2020, PloS one.

[30]  Stuart Morgan,et al.  ASPset: An outdoor sports pose video dataset with 3D keypoint annotations , 2021, Image Vis. Comput..

[31]  Xiaowei Zhou,et al.  A survey on monocular 3D human pose estimation , 2020, Virtual Real. Intell. Hardw..

[32]  Petros Daras,et al.  HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media , 2020, IEEE Access.

[33]  Cristian Sminchisescu,et al.  GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yangang Wang,et al.  Object-Occluded Human Shape and Pose Estimation From a Single Color Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cristian Sminchisescu,et al.  Three-Dimensional Reconstruction of Human Interactions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ruixu Liu,et al.  Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Hakan Bilen,et al.  Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[39]  Simone Calderara,et al.  Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yingli Tian,et al.  Monocular human pose estimation: A survey of deep learning-based methods , 2020, Comput. Vis. Image Underst..

[41]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Michael J. Dinneen,et al.  Four Things Everyone Should Know to Improve Batch Normalization , 2019, ICLR.

[43]  Jae Shin Yoon,et al.  HUMBI: A Large Multiview Dataset of Human Body Expressions , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Alexander G. Schwing,et al.  Chirality Nets for Human Pose Regression , 2019, NeurIPS.

[45]  Kyoung Mu Lee,et al.  Camera Distance-Aware Top-Down Approach for 3D Multi-Person Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Nasser Kehtarnavaz,et al.  Deep Learning-based Human Pose Estimation: A Survey , 2020, ACM Comput. Surv..

[47]  Alexander G. Schwing,et al.  SAIL-VOS: Semantic Amodal Instance Level Video Object Segmentation – A Synthetic Dataset and Baselines , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Zhe Wang,et al.  Geometric Pose Affordance: 3D Human Pose with Scene Constraints , 2019, ArXiv.

[49]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Francesc Moreno-Noguer,et al.  3DPeople: Modeling the Geometry of Dressed Humans , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[53]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Satoru Fukayama,et al.  AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing , 2019, ISMIR.

[55]  Tang Tang,et al.  Multi-Domain Pose Network for Multi-Person Pose Estimation and Tracking , 2018, ECCV Workshops.

[56]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[57]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[58]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[59]  Andrea Palazzi,et al.  Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World , 2018, ECCV.

[60]  Christian Theobalt,et al.  Single-Shot Multi-person 3D Pose Estimation from Monocular RGB , 2017, 2018 International Conference on 3D Vision (3DV).

[61]  Bernt Schiele,et al.  PoseTrack: A Benchmark for Human Pose Estimation and Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Kai Hormann,et al.  Generalized Barycentric Coordinates in Computer Graphics and Computational Mechanics , 2017 .

[63]  Charles Malleson,et al.  Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors , 2017, BMVC.

[64]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[65]  Twan van Laarhoven,et al.  L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[66]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[67]  Antoni B. Chan,et al.  Martial Arts, Dancing and Sports dataset: A challenging stereo and multi-view dataset for 3D human pose estimation , 2017, Image Vis. Comput..

[68]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[70]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[71]  Tamim Asfour,et al.  Unifying Representations and Large-Scale Whole-Body Motion Databases for Studying Human Motion , 2016, IEEE Transactions on Robotics.

[72]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[73]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[74]  J. Cunningham,et al.  Linear dimensionality reduction: survey, insights, and generalizations , 2014, J. Mach. Learn. Res..

[75]  Michael J. Black,et al.  MoSh: motion and shape capture from sparse markers , 2014, ACM Trans. Graph..

[76]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[79]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[80]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[81]  Cristian Sminchisescu,et al.  Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[82]  Remco C. Veltkamp,et al.  UMPM benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[83]  R. C. Veltkamp,et al.  Utrecht Multi-Person Motion ( UMPM ) benchmark , 2011 .

[84]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[85]  Charles M Terry,et al.  The martial arts. , 2006, Physical medicine and rehabilitation clinics of North America.

[86]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[87]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .