Physics-based Human Motion Estimation and Synthesis from Videos

Human motion synthesis is an important problem with applications in graphics, gaming and simulation environments for robotics. Existing methods require accurate motion capture data for training, which is costly to obtain. Instead, we propose a framework for training generative models of physically plausible human motion directly from monocular RGB videos, which are much more widely available. At the core of our method is a novel optimization formulation that corrects imperfect image-based pose estimations by enforcing physics constraints and reasons about contacts in a differentiable way. This optimization yields corrected 3D poses and motions, as well as their corresponding contact forces. Results show that our physicallycorrected motions significantly outperform prior work on pose estimation. We can then use these to train a generative model to synthesize future motion. We demonstrate both qualitatively and quantitatively significantly improved motion estimation, synthesis quality and physical plausibility achieved by our method on the large scale Human3.6m dataset [12] as compared to prior kinematic and physicsbased methods. By enabling learning of motion synthesis from video, our method paves the way for large-scale, realistic and diverse motion synthesis.

[1]  Michael J. Black,et al.  We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Cewu Lu,et al.  HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation , 2020, ArXiv.

[3]  Raquel Urtasun,et al.  Recovering and Simulating Pedestrians in the Wild , 2020, CoRL.

[4]  Ersin Yumer,et al.  MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics , 2018, ECCV.

[5]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Leonidas J. Guibas,et al.  Contact and Human Dynamics from Monocular Video , 2020, SCA.

[7]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zoran Popovic,et al.  Discovery of complex behaviors through contact-invariant optimization , 2012, ACM Trans. Graph..

[9]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[10]  Ludovic Righetti,et al.  Variable Horizon MPC With Swing Foot Dynamics for Bipedal Walking Control , 2021, IEEE Robotics and Automation Letters.

[11]  Nicolas Mansard,et al.  Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Zoran Popovic,et al.  Motion fields for interactive character locomotion , 2010, CACM.

[13]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[14]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[15]  Carsten Stoll,et al.  TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video , 2020, ECCV.

[16]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[17]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[18]  Jonas Beskow,et al.  MoGlow , 2019, ACM Trans. Graph..

[19]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[20]  Zoran Popovic,et al.  Interactive Control of Diverse Complex Characters with Neural Networks , 2015, NIPS.

[21]  James M. Rehg,et al.  4D Human Body Capture from Egocentric Video via 3D Scene Grounding , 2020, 2021 International Conference on 3D Vision (3DV).

[22]  M. V. D. Panne,et al.  Sampling-based contact-rich motion control , 2010, ACM Trans. Graph..

[23]  C. K. Liu,et al.  A Quick Tutorial on Multibody Dynamics , 2012 .

[24]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jungdam Won,et al.  A scalable approach to control diverse behaviors for physically simulated characters , 2020, ACM Trans. Graph..

[26]  Vladlen Koltun,et al.  Animating human lower limbs using contact-invariant optimization , 2013, ACM Trans. Graph..

[27]  Ye Yuan,et al.  Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis , 2020, NeurIPS.

[28]  Bo Ren,et al.  GPU-based contact-aware trajectory optimization using a smooth force model , 2019, Symposium on Computer Animation.

[29]  Marco Hutter,et al.  Gait and Trajectory Optimization for Legged Systems Through Phase-Based End-Effector Parameterization , 2018, IEEE Robotics and Automation Letters.

[30]  Baining Guo,et al.  Improving Sampling‐based Motion Control , 2015, Comput. Graph. Forum.

[31]  Lucas Kovar,et al.  Motion Graphs , 2002, ACM Trans. Graph..

[32]  Sanja Fidler,et al.  Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[33]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Otmar Hilliges,et al.  Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[36]  Yi Zhou,et al.  Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.

[37]  Michiel van de Panne,et al.  Character controllers using motion VAEs , 2020, ACM Trans. Graph..

[38]  Pavlo Molchanov,et al.  KAMA: 3D Keypoint Aware Body Mesh Articulation , 2021, 2021 International Conference on 3D Vision (3DV).

[39]  Jitendra Malik,et al.  Predicting 3D Human Dynamics From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Otmar Hilliges,et al.  Learning Human Motion Models for Long-Term Predictions , 2017, 2017 International Conference on 3D Vision (3DV).

[41]  Ye Yuan,et al.  DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[42]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Christian Theobalt,et al.  DeepCap: Monocular Human Performance Capture Using Weak Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Nancy S. Pollard,et al.  Evaluating motion graphs for character animation , 2007, TOGS.

[47]  Odest Chadwicke Jenkins,et al.  Physical simulation for probabilistic motion tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Sebastian Starke,et al.  Local motion phases for learning multi-contact character movements , 2020, ACM Trans. Graph..

[49]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[50]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Zoran Popovic,et al.  Contact-invariant optimization for hand manipulation , 2012, SCA '12.

[53]  Sebastian Starke,et al.  Neural state machine for character-scene interactions , 2019, ACM Trans. Graph..

[54]  Pavlo Molchanov,et al.  Weakly-Supervised 3D Human Pose Learning via Multi-View Images in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Sanja Fidler,et al.  UniCon: Universal Neural Controller For Physics-based Character Motion , 2020, ArXiv.