FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration

3D human pose estimation can facilitate various applications, such as assistive technologies and AR/VR. Most existing monocular 3D pose estimation approaches only focus on a single body part, neglecting the fact that the essential nuance of human motion is conveyed through a concert of subtle movements of face, hands, and body. In this paper, we present FrankMocap, a fast and accurate whole-body 3D pose estimation system that can produce 3D face, hands, and body simultaneously from in-the-wild monocular images. The idea of FrankMocap is its modular design: We first run 3D pose regression methods for face, hands, and body independently, followed by composing the regression outputs via an integration module. The separate regression modules allow us to take full advantage of their state-of-the-art performances without compromising the original accuracy and reliability in practice. We develop three different integration modules that trade off between latency and accuracy. All of them are capable of providing simple yet effective solutions to unify the separate outputs into seamless whole-body pose estimation results. We quantitatively and qualitatively demonstrate that our modularized system outperforms both the optimization-based and end-to-end methods of estimating whole-body pose.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Cewu Lu,et al.  HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation , 2020, ArXiv.

[3]  Bo Dai,et al.  Scene-aware Generative Network for Human Motion Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jie Song,et al.  Human Body Model Fitting by Learned Gradient Descent , 2020, ECCV.

[5]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[6]  Wanli Ouyang,et al.  3D Human Mesh Regression With Dense Correspondence , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yaser Sheikh,et al.  Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Daniil Osokin,et al.  Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose , 2018, ICPRAM.

[10]  Giacomo Boracchi,et al.  Uniform Motion Blur in Poissonian Noise: Blur/Noise Tradeoff , 2011, IEEE Transactions on Image Processing.

[11]  Dragomir Anguelov,et al.  SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[12]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[13]  Song-Chun Zhu,et al.  DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Pengfei Wan,et al.  Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[16]  J. Gower Generalized procrustes analysis , 1975 .

[17]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[18]  Margrit Betke,et al.  Home-Based Physical Therapy with an Interactive Computer Vision System , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[19]  Kyoung Mu Lee,et al.  I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[20]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[21]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Trevor Darrell,et al.  Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Dahua Lin,et al.  Motion Guided 3D Pose Estimation from Videos , 2020, ECCV.

[31]  Xi Zhou,et al.  Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network , 2018, ECCV.

[32]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[33]  Gemma Piella,et al.  Survey on 3D face reconstruction from uncalibrated images , 2020, Comput. Sci. Rev..

[34]  Yaser Sheikh,et al.  Driving-signal aware full-body avatars , 2021, ACM Trans. Graph..

[35]  Cheng Li,et al.  Delving Deep Into Hybrid Annotations for 3D Human Recovery in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Patrick Pérez,et al.  State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications , 2018, Comput. Graph. Forum.

[37]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Mohan M. Trivedi,et al.  Deep Learning for Assistive Computer Vision , 2018, ECCV Workshops.

[40]  Chen Change Loy,et al.  Chasing the Tail in Monocular 3D Human Reconstruction With Prototype Memory , 2020, IEEE Transactions on Image Processing.

[41]  Patrick Pérez,et al.  MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Ramakant Nevatia,et al.  ExpNet: Landmark-Free, Deep, 3D Facial Expressions , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[43]  Cheng Li,et al.  Pose-Robust Face Recognition via Deep Residual Equivariant Mapping , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Kostas Daniilidis,et al.  Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kyoung Mu Lee,et al.  Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose , 2020, ECCV.

[46]  Vincent Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[49]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[50]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[51]  Tal Hassner,et al.  Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[54]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Michael J. Black,et al.  Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[56]  Iasonas Kokkinos,et al.  HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[58]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[59]  Christian Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ziyan Wu,et al.  Hierarchical Kinematic Human Mesh Recovery , 2020, ECCV.

[61]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Christian Theobalt,et al.  Monocular Real-time Full Body Capture with Inter-part Correlations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Michael J. Black,et al.  Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[66]  Dimitrios Tzionas,et al.  Monocular Expressive Body Regression through Body-Driven Attention , 2020, ECCV.

[67]  Giacomo Boracchi,et al.  Modeling the Performance of Image Restoration From Motion Blur , 2012, IEEE Transactions on Image Processing.

[68]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[70]  Michael J. Black,et al.  Dyna: a model of dynamic human shape in motion , 2015, ACM Trans. Graph..

[71]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[73]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Mingliang Chen,et al.  3D Hand Pose Tracking and Estimation Using Stereo Matching , 2016, ArXiv.

[75]  Cristian Sminchisescu,et al.  GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Pascal Fua,et al.  XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera , 2019, ACM Trans. Graph..