Full-Body Human Motion Capture from Monocular Depth Images

Optical capturing of human body motion has many practical applications, ranging from motion analysis in sports and medicine, over ergonomy research, up to computer animation in game and movie production. Unfortunately, many existing approaches require expensive multi-camera systems and controlled studios for recording, and expect the person to wear special marker suits. Furthermore, marker-less approaches demand dense camera arrays and indoor recording. These requirements and the high acquisition cost of the equipment makes it applicable only to a small number of people. This has changed in recent years, when the availability of inexpensive depth sensors, such as time-of-flight cameras or the Microsoft Kinect has spawned new research on tracking human motions from monocular depth images. These approaches have the potential to make motion capture accessible to much larger user groups. However, despite significant progress over the last years, there are still unsolved challenges that limit applicability of depth-based monocular full body motion capture. Algorithms are challenged by very noisy sensor data, (self) occlusions, or other ambiguities implied by the limited information that a depth sensor can extract of the scene. In this article, we give an overview on the state-of-the-art in full body human motion capture using depth cameras. Especially, we elaborate on the challenges current algorithms face and discuss possible solutions. Furthermore, we investigate how the integration of additional sensor modalities may help to resolve some of the ambiguities and improve tracking results.

[1]  Adrian Hilton,et al.  Spherical matching for temporal correspondence of non-rigid surfaces , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Hans-Peter Seidel,et al.  Towards Cross-Modal Comparison of Human Motion Data , 2011, DAGM-Symposium.

[3]  Hans-Peter Seidel,et al.  A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[4]  Behzad Dariush,et al.  Kinematic self retargeting: A framework for human pose estimation , 2010, Comput. Vis. Image Underst..

[5]  Rüdiger Dillmann,et al.  Fusion of 2d and 3d sensor data for articulated body tracking , 2009, Robotics Auton. Syst..

[6]  Huamin Wang,et al.  Modeling deformable objects from a single depth camera , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Wolfram Burgard,et al.  Accurate human motion capture in large areas by combining IMU- and laser-based people tracking , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Wojciech Matusik,et al.  Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[9]  Hans-Peter Seidel,et al.  Motion reconstruction using sparse accelerometer data , 2011, TOGS.

[10]  Alberto Menache Motion Capture Primer , 2011 .

[11]  Kiriakos N. Kutulakos Trends and Topics in Computer Vision , 2010, Lecture Notes in Computer Science.

[12]  Bodo Rosenhahn,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Combined Region-and Motion-based 3d Tracking of Rigid and Articulated Objects , 2022 .

[13]  Qionghai Dai,et al.  Performance Capture of Interacting Characters with Handheld Kinects , 2012, ECCV.

[14]  Matthieu Guillaumin,et al.  Segmentation Propagation in ImageNet , 2012, ECCV.

[15]  Amit Bleiweiss,et al.  Markerless motion capture using a single depth sensor , 2009, SIGGRAPH ASIA '09.

[16]  Sebastian Thrun,et al.  Real time motion capture using a single time-of-flight camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Hans-Peter Seidel,et al.  Markerless motion capture of interacting characters using multi-view image segmentation , 2011, CVPR 2011.

[19]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[20]  Jessica K. Hodgins,et al.  Action capture with accelerometers , 2008, SCA '08.

[21]  Michael J. Black,et al.  Home 3D body scans from noisy image and range data , 2011, 2011 International Conference on Computer Vision.

[22]  Sebastian Thrun,et al.  Real-Time Human Pose Tracking from Range Data , 2012, ECCV.

[23]  Adrian Hilton,et al.  Correspondence labelling for wide-timeframe free-form surface matching , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Hans-Peter Seidel,et al.  Performance capture from sparse multi-view video , 2008, SIGGRAPH 2008.

[25]  Bodo Rosenhahn,et al.  Multisensor-fusion for 3D full-body human motion capture , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[27]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[28]  Derek Nowrouzezahrai,et al.  Learning hatching for pen-and-ink illustration of surfaces , 2012, TOGS.

[29]  Alberto Menache,et al.  Understanding Motion Capture for Computer Animation and Video Games , 1999 .

[30]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[31]  Hans-Peter Seidel,et al.  Fast articulated motion tracking using a sums of Gaussians body model , 2011, 2011 International Conference on Computer Vision.

[32]  Sebastian Thrun,et al.  Real-time identification and localization of body parts from depth images , 2010, 2010 IEEE International Conference on Robotics and Automation.

[33]  Andrew Blake,et al.  Articulated body motion capture by annealed particle filtering , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[34]  David Kim,et al.  Shake'n'sense: reducing interference for overlapping structured light depth cameras , 2012, CHI.

[35]  Hans-Peter Seidel,et al.  Motion capture using joint skeleton tracking and surface estimation , 2009, CVPR.

[36]  Adrian Hilton,et al.  Surface Capture for Performance-Based Animation , 2007, IEEE Computer Graphics and Applications.

[37]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[38]  Ramesh Raskar,et al.  Image-based visual hulls , 2000, SIGGRAPH.

[39]  Kenny Erleben,et al.  GPU Accelerated Likelihoods for Stereo-Based Articulated Tracking , 2010, ECCV Workshops.

[40]  Sebastian Thrun,et al.  SCAPE: shape completion and animation of people , 2005, SIGGRAPH 2005.

[41]  Raquel Urtasun,et al.  Combining discriminative and generative methods for 3D deformable surface and articulated pose reconstruction , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[42]  Ruigang Yang,et al.  Accurate 3D pose estimation from a single depth image , 2011, 2011 International Conference on Computer Vision.

[43]  Arno Zinke,et al.  Fast local and global similarity searches in large motion capture databases , 2010, SCA '10.

[44]  Craig Gotsman,et al.  Articulated Object Reconstruction and Markerless Motion Capture from Depth Video , 2008, Comput. Graph. Forum.

[45]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[46]  Jinxiang Chai,et al.  Accurate realtime full-body motion capture using a single depth camera , 2012, ACM Trans. Graph..

[47]  Henry Fuchs,et al.  Reducing interference between multiple structured light depth sensors using motion , 2012, 2012 IEEE Virtual Reality Workshops (VRW).

[48]  Jitendra Malik,et al.  Twist Based Acquisition and Tracking of Animal and Human Kinematics , 2004, International Journal of Computer Vision.

[49]  Jessica K. Hodgins,et al.  Performance animation from low-dimensional control signals , 2005, SIGGRAPH 2005.

[50]  Trevor Darrell,et al.  Avoiding the "streetlight effect": tracking by exploring likelihood modes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[51]  Hans-Peter Seidel,et al.  A Statistical Model of Human Pose and Body Shape , 2009, Comput. Graph. Forum.