MonoPerfCap

We present the first marker-less approach for temporally coherent 3D performance capture of a human with general clothing from monocular video. Our approach reconstructs articulated human skeleton motion as well as medium-scale non-rigid surface deformations in general scenes. Human performance capture is a challenging problem due to the large range of articulation, potentially fast motion, and considerable non-rigid deformations, even from multi-view data. Reconstruction from monocular video alone is drastically more challenging, since strong occlusions and the inherent depth ambiguity lead to a highly ill-posed reconstruction problem. We tackle these challenges by a novel approach that employs sparse 2D and 3D human pose detections from a convolutional neural network using a batch-based pose estimation strategy. Joint recovery of per-batch motion allows us to resolve the ambiguities of the monocular reconstruction problem based on a low-dimensional trajectory subspace. In addition, we propose refinement of the surface geometry based on fully automatically extracted silhouettes to enable medium-scale non-rigid alignment. We demonstrate state-of-the-art performance capture results that enable exciting applications such as video editing and free viewpoint video, previously infeasible from monocular video. Our qualitative and quantitative evaluation demonstrates that our approach significantly outperforms previous monocular methods in terms of accuracy, robustness, and scene complexity that can be handled.

[1]  Ramesh Raskar,et al.  Image-based visual hulls , 2000, SIGGRAPH.

[2]  Pushmeet Kohli,et al.  PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts , 2006, ECCV.

[3]  Hans-Peter Seidel,et al.  High Accuracy Optical Flow Serves 3-D Pose Tracking: Exploiting Contour and Flow Based Constraints , 2006, ECCV.

[4]  Rómer Rosales,et al.  Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation , 2006, International Journal of Computer Vision.

[5]  David J. Fleet,et al.  Temporal motion models for monocular and multiview 3D human body tracking , 2006, Comput. Vis. Image Underst..

[6]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, SIGGRAPH 2009.

[7]  Hans-Peter Seidel,et al.  General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues , 2016, ECCV.

[8]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[9]  M. Pauly,et al.  Embedded deformation for shape manipulation , 2007, SIGGRAPH 2007.

[10]  John P. Lewis,et al.  Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation , 2000, SIGGRAPH.

[11]  Cristian Sminchisescu,et al.  Estimating Articulated Human Motion with Covariance Scaled Sampling , 2003, Int. J. Robotics Res..

[12]  Hans-Peter Seidel,et al.  VNect , 2017 .

[13]  Alvaro Collet,et al.  High-quality streamable free-viewpoint video , 2015, ACM Trans. Graph..

[14]  Hans-Peter Seidel,et al.  MovieReshape: tracking and reshaping of humans in videos , 2010, SIGGRAPH 2010.

[15]  Pascal Fua,et al.  Tracking and Modeling People in Video Sequences , 2001, Comput. Vis. Image Underst..

[16]  Hans-Peter Seidel,et al.  Performance capture from sparse multi-view video , 2008, SIGGRAPH 2008.

[17]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Jinxiang Chai,et al.  VideoMocap: modeling physically realistic human motion from monocular video sequences , 2010, SIGGRAPH 2010.

[19]  Jovan Popović,et al.  Dynamic shape capture using multi-view photometric stereo , 2009, SIGGRAPH 2009.

[20]  Yaser Sheikh,et al.  3D Trajectory Reconstruction under Perspective Projection , 2015, International Journal of Computer Vision.

[21]  Christian Theobalt,et al.  On-set performance capture of multiple actors with a stereo camera , 2013, ACM Trans. Graph..

[22]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[23]  Markus H. Gross,et al.  Scalable 3D video of dynamic scenes , 2005, The Visual Computer.

[24]  Jirí Zára,et al.  Skinning with dual quaternions , 2007, SI3D.

[25]  Derek Bradley,et al.  Markerless garment capture , 2008, SIGGRAPH 2008.

[26]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[27]  Marcus A. Magnor,et al.  Garment Replacement in Monocular Video Sequences , 2014, ACM Trans. Graph..

[28]  Christian Theobalt,et al.  Full Body Performance Capture under Uncontrolled and Varying Illumination: A Shading-Based Approach , 2012, ECCV.

[29]  Ligang Liu,et al.  Parametric reshaping of human bodies in images , 2010, SIGGRAPH 2010.

[30]  Gérard G. Medioni,et al.  Capturing Dynamic Textured Surfaces of Moving Targets , 2016, ECCV.

[31]  Hans-Peter Seidel,et al.  Free-viewpoint video of human actors , 2003, ACM Trans. Graph..

[32]  Qionghai Dai,et al.  Performance Capture of Interacting Characters with Handheld Kinects , 2012, ECCV.

[33]  Sebastian Thrun,et al.  SCAPE: shape completion and animation of people , 2005, SIGGRAPH 2005.

[34]  Wojciech Matusik,et al.  Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[35]  Andrew W. Fitzgibbon,et al.  Real-time non-rigid reconstruction using an RGB-D camera , 2014, ACM Trans. Graph..