HuMoR: 3D Human Motion Model for Robust Pose Estimation

We introduce HuMoR: a 3D Human Motion Model for Robust Estimation of temporal pose and shape. Though substantial progress has been made in estimating 3D human motion and shape from dynamic observations, recovering plausible pose sequences in the presence of noise and occlusions remains a challenge. For this purpose, we propose an expressive generative model in the form of a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. Furthermore, we introduce a flexible optimization-based approach that leverages HuMoR as a motion prior to robustly estimate plausible pose and shape from ambiguous observations. Through extensive evaluations, we demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset, and enables motion reconstruction from multiple input modalities including 3D keypoints and RGB(-D) videos. See the project page at geometry.stanford.edu/projects/humor.

[1]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[2]  Michael J. Black,et al.  PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Kris Kitani,et al.  SimPoE: Simulated Character Control for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sung-Hee Lee,et al.  LoBSTr: Real‐time Lower‐body Pose Prediction from Sparse Upper‐body Tracking Signals , 2021, Comput. Graph. Forum.

[5]  Joachim Tesch,et al.  Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael J. Black,et al.  We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  James M. Rehg,et al.  4D Human Body Capture from Egocentric Video via 3D Scene Grounding , 2020, 2021 International Conference on 3D Vision (3DV).

[8]  Andreas Aristidou,et al.  MotioNet , 2020, ACM Trans. Graph..

[9]  Ivan Kobyzev,et al.  Normalizing Flows: An Introduction and Review of Current Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  C. Schmid,et al.  Synthetic Humans for Action Recognition from Unseen Viewpoints , 2019, International Journal of Computer Vision.

[11]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jianbo Shi,et al.  Towards Statistically Provable Geometric 3D Human Pose Recovery , 2021, SIAM J. Imaging Sci..

[13]  A. Vedaldi,et al.  3D Multi-bodies: Fitting Sets of Plausible 3D Human Models to Ambiguous Image Data , 2020, NeurIPS.

[14]  M. A. Brubaker,et al.  Probabilistic Character Motion Synthesis using a Hierarchical Deep Latent Variable Model , 2020, Comput. Graph. Forum.

[15]  Christian Theobalt,et al.  PhysCap , 2020, ACM Trans. Graph..

[16]  Dimitrios Tzionas,et al.  Monocular Expressive Body Regression through Body-Driven Attention , 2020, ECCV.

[17]  David F. Fouhey,et al.  Full-Body Awareness from Partial Observations , 2020, ECCV.

[18]  Zhengyi Luo,et al.  3D Human Motion Estimation via Motion Compression and Refinement , 2020, ACCV.

[19]  Leonidas J. Guibas,et al.  Contact and Human Dynamics from Monocular Video , 2020, SCA.

[20]  Michiel van de Panne,et al.  Character controllers using motion VAEs , 2020, ACM Trans. Graph..

[21]  Yangang Wang,et al.  Object-Occluded Human Shape and Pose Estimation From a Single Color Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lars Petersson,et al.  A Stochastic Conditioning Scheme for Diverse Human Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Cristian Sminchisescu,et al.  Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows , 2020, ECCV.

[24]  Kris M. Kitani,et al.  DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[25]  Christian Theobalt,et al.  DeepCap: Monocular Human Performance Capture Using Weak Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ziyan Wu,et al.  Hierarchical Kinematic Human Mesh Recovery , 2020, ECCV.

[27]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jonas Beskow,et al.  MoGlow , 2019, ACM Trans. Graph..

[29]  Francesc Moreno-Noguer,et al.  Context-Aware Human Motion Prediction , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Pascal Fua,et al.  XNect , 2019, ACM Trans. Graph..

[31]  Sebastian Starke,et al.  Neural state machine for character-scene interactions , 2019, ACM Trans. Graph..

[32]  Mohammad Norouzi,et al.  Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse , 2019, NeurIPS.

[33]  Otmar Hilliges,et al.  Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Liu Wu,et al.  Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Jitendra Malik,et al.  Predicting 3D Human Dynamics From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Andrew Zisserman,et al.  Sim2real transfer learning for 3D human pose estimation: motion to the rescue , 2019, NeurIPS.

[40]  Iasonas Kokkinos,et al.  HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Andrew Zisserman,et al.  Exploiting Temporal Context for 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Nicolas Mansard,et al.  Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Dario Pavllo,et al.  Modeling Human Motion with Quaternion-Based Neural Networks , 2019, International Journal of Computer Vision.

[47]  Jan Kautz,et al.  PlaneRCNN: 3D Plane Detection and Reconstruction From a Single Image , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Kaiming He,et al.  Group Normalization , 2018, International Journal of Computer Vision.

[52]  A. Aristidou Digital Dance Ethnography: Organizing Large Dance Collections , 2019 .

[53]  Satoru Fukayama,et al.  AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing , 2019, ISMIR.

[54]  Michael J. Black,et al.  Resolving 3 D Human Pose Ambiguities with 3 D Scene Constraints , 2019 .

[55]  Jitendra Malik,et al.  SFV , 2018, ACM Trans. Graph..

[56]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[57]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Cordelia Schmid,et al.  BodyNet: Volumetric Inference of 3D Human Body Shapes , 2018, ECCV.

[60]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[63]  Charles Malleson,et al.  Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors , 2017, BMVC.

[64]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[65]  Taku Komura,et al.  Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[66]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[67]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[68]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[69]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[72]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[73]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[74]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[75]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[76]  Tamim Asfour,et al.  The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[77]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[78]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[80]  Michael J. Black,et al.  MoSh: motion and shape capture from sparse markers , 2014, ACM Trans. Graph..

[81]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[82]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[83]  Ludovic Hoyet,et al.  Sleight of hand: perception of finger motion from reduced marker sets , 2012, I3D '12.

[84]  Hans-Peter Seidel,et al.  A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[85]  Sebastian Thrun,et al.  Real time motion capture using a single time-of-flight camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[86]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[87]  David J. Fleet,et al.  Physics-Based Person Tracking Using the Anthropomorphic Walker , 2010, International Journal of Computer Vision.

[88]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[89]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[90]  Lucas Kovar,et al.  Motion Graphs , 2002, ACM Trans. Graph..

[91]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[92]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[93]  David J. Fleet,et al.  Temporal motion models for monocular and multiview 3D human body tracking , 2006, Comput. Vis. Image Underst..

[94]  David J. Fleet,et al.  3D People Tracking with Gaussian Process Dynamical Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[95]  C. Karen Liu,et al.  Learning physics-based motion style with nonlinear inverse optimization , 2005, ACM Trans. Graph..

[96]  A. Elgammal,et al.  Separating style and content on a nonlinear manifold , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[97]  N. Troje Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.

[98]  Harry Shum,et al.  Motion texture: a two-level statistical model for character motion synthesis , 2002, ACM Trans. Graph..

[99]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[100]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[101]  Michael J. Black,et al.  Learning and Tracking Cyclic Human Motion , 2000, NIPS.

[102]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[103]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[104]  William T. Freeman,et al.  Bayesian Reconstruction of 3D Human Motion from Single-Camera Video , 1999, NIPS.

[105]  Michael F. Cohen,et al.  Verbs and Adverbs: Multidimensional Motion Interpolation , 1998, IEEE Computer Graphics and Applications.

[106]  Stuart Geman,et al.  Statistical methods for tomographic image reconstruction , 1987 .

[107]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .