3D articulated skeleton extraction using a single consumer-grade depth camera

Abstract Articulated skeleton extraction or learning has been extensively studied for 2D (e.g., images and video) and 3D (e.g., volume sequences, motion capture, and mesh sequences) data. Nevertheless, robustly and accurately learning 3D articulated skeletons from point set sequences captured by a single consumer-grade depth camera still remains challenging, since such data are often corrupted with substantial noise and outliers. Relatively few approaches have been proposed to tackle this problem. In this paper, we present a novel unsupervised framework to address this issue. Specifically, we first build one-to-one point correspondences among the point cloud frames in a sequence with our non-rigid point cloud registration algorithm. We then generate a skeleton involving a reasonable number of joints and bones with our skeletal structure extraction algorithm. We lastly present an iterative Linear Blend Skinning based algorithm for accurate joints learning. At the end, our method can learn a quality articulated skeleton from a single 3D point sequence possibly corrupted with noise and outliers. Through qualitative and quantitative evaluations on both publicly available data and in-house Kinect-captured data, we show that our unsupervised approach soundly outperforms state of the art techniques in terms of both quality (i.e., visual) and accuracy (i.e., Euclidean distance error metric). Moreover, the poses of our extracted skeletons are even comparable to those by KinectSDK, a well-known supervised pose estimation technique; for example, our method and KinectSDK achieves similar distance errors of 0.0497 and 0.0521.

[1]  Wolfram Burgard,et al.  Vision-based detection for learning articulation models of cabinet doors and drawers in household environments , 2010, 2010 IEEE International Conference on Robotics and Automation.

[2]  David A. Forsyth,et al.  Building models of animals from video , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Zhigang Deng,et al.  Robust and accurate skeletal rigging from mesh sequences , 2014, ACM Trans. Graph..

[4]  Marc Pollefeys,et al.  Automatic Kinematic Chain Building from Feature Trajectories of Articulated Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Maja J. Mataric,et al.  Markerless kinematic model and motion capture from volume sequences , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[6]  Slobodan Ilic,et al.  Probabilistic Deformable Surface Tracking from Multiple Videos , 2010, ECCV.

[7]  Bin Li,et al.  Probabilistic Model for Robust Affine and Non-Rigid Point Set Matching. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[8]  Daniel Thalmann,et al.  Joint-dependent local deformations for hand animation and object grasping , 1989 .

[9]  S.Arif Kamal,et al.  Space-Time Representation in the Brain. , 1992 .

[10]  Takeo Kanade,et al.  Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[11]  Hans-Peter Seidel,et al.  Learning skeletons for shape and pose , 2010, I3D '10.

[12]  Shaharyar Kamal,et al.  Real-time life logging via a depth silhouette-based human activity recognition system for smart home services , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[13]  Scott Schaefer,et al.  Example-based skeleton extraction , 2007, Symposium on Geometry Processing.

[14]  Stéphane Lecoeuche,et al.  An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based Human Action Recognition , 2018, Comput. Vis. Image Underst..

[15]  Marc Pollefeys,et al.  Joint Camera Pose Estimation and 3D Human Pose Estimation in a Multi-camera Setup , 2014, ACCV.

[16]  Daniel Cohen-Or,et al.  L1-medial skeleton of point cloud , 2013, ACM Trans. Graph..

[17]  Geonho Cha,et al.  Samba: A Real-Time Motion Capture System Using Wireless Camera Sensor Networks , 2014, Sensors.

[18]  Yiannis Demiris,et al.  Unsupervised learning of complex articulated kinematic structures combining motion and skeleton information , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Peter V. Gehler,et al.  Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Benjamin Busam,et al.  Fusion 4 D : Real-time Performance Capture of Challenging Scene Seminar : Recent Trends in 3 D Computer Vision , 2016 .

[21]  Ho Yub Jung,et al.  A Sequential Approach to 3D Human Pose Estimation: Separation of Localization and Identification of Body Joints , 2016, ECCV.

[22]  Stan Sclaroff,et al.  Estimating 3D hand pose from a cluttered image , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  Sebastian Thrun,et al.  Recovering Articulated Object Models from 3D Range Data , 2004, UAI.

[24]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[25]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[26]  Daijin Kim,et al.  A Depth Video Sensor-Based Life-Logging Human Activity Recognition System for Elderly Care in Smart Indoor Environments , 2014, Sensors.

[27]  Xuan Song,et al.  Unsupervised skeleton extraction and motion capture from 3D deformable matching , 2013, Neurocomputing.

[28]  Ruigang Yang,et al.  Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Qionghai Dai,et al.  Robust Non-rigid Motion Tracking and Surface Reconstruction Using L0 Regularization , 2015, ICCV.

[30]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Anand Rangarajan,et al.  A new point matching algorithm for non-rigid registration , 2003, Comput. Vis. Image Underst..

[32]  Andrea Tagliasacchi,et al.  Sphere-meshes for real-time hand modeling and tracking , 2016, ACM Trans. Graph..

[33]  Jacques Vauclair,et al.  Functional categorization of objects and of their pictures in baboons , 1998 .

[34]  Ruigang Yang,et al.  Accurate 3D pose estimation from a single depth image , 2011, 2011 International Conference on Computer Vision.

[35]  Daijin Kim,et al.  Robust human activity recognition from depth video using spatiotemporal multi-fused features , 2017, Pattern Recognit..

[36]  B. Prabhakaran,et al.  A 3D tele-immersion streaming approach using skeleton-based prediction , 2013, MM '13.

[37]  Mircea Nicolescu,et al.  Vision-based hand pose estimation: A review , 2007, Comput. Vis. Image Underst..

[38]  Eric Brachmann,et al.  Pose Estimation of Kinematic Chain Instances via Object Coordinate Regression , 2015, BMVC.

[39]  M. Pauly,et al.  Embedded deformation for shape manipulation , 2007, SIGGRAPH 2007.

[40]  Andrea Torsello,et al.  Coarse-to-fine skeleton extraction for high resolution 3D meshes , 2014, Comput. Vis. Image Underst..

[41]  Jinxiang Chai,et al.  Accurate realtime full-body motion capture using a single depth camera , 2012, ACM Trans. Graph..

[42]  D. Cohen-Or,et al.  Curve skeleton extraction from incomplete point cloud , 2009, SIGGRAPH 2009.

[43]  Shaharyar Kamal,et al.  Dense RGB-D Map-Based Human Tracking and Activity Recognition using Skin Joints Features and Self-Organizing Map , 2015, KSII Trans. Internet Inf. Syst..

[44]  Ilan Shimshoni,et al.  Mean shift based clustering in high dimensions: a texture classification example , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[45]  Qing Lei,et al.  A Survey on Human Pose Estimation , 2016, Intell. Autom. Soft Comput..

[46]  Gary K. L. Tam,et al.  Registration of 3D Point Clouds and Meshes: A Survey from Rigid to Nonrigid , 2013, IEEE Transactions on Visualization and Computer Graphics.

[47]  Hao Zhang,et al.  Automatic reconstruction of tree skeletal structures from point clouds , 2010, SIGGRAPH 2010.

[48]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, ACM Trans. Graph..

[49]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[50]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[51]  Andriy Myronenko,et al.  On the closed-form solution of the rotation matrix arising in computer vision problems , 2009, ArXiv.

[52]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[54]  Richard S. Zemel,et al.  Learning Articulated Structure and Motion , 2010, International Journal of Computer Vision.

[55]  Ian D. Reid,et al.  Articulated structure from motion by factorization , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[56]  Hans-Peter Seidel,et al.  Personalization and Evaluation of a Real-Time Depth-Based Full Body Tracker , 2013, 2013 International Conference on 3D Vision.

[57]  Hans-Peter Seidel,et al.  Automatic Conversion of Mesh Animations into Skeleton‐based Animations , 2008, Comput. Graph. Forum.

[58]  Allen Y. Yang,et al.  Distributed segmentation and classification of human actions using a wearable motion sensor network , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[59]  Ahmad Jalal,et al.  Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[60]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[61]  Ho Yub Jung,et al.  Random tree walk toward instantaneous 3D human pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Peter V. Gehler,et al.  Strong Appearance and Expressive Spatial Models for Human Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[64]  Alan L. Yuille,et al.  Non-Rigid Point Set Registration by Preserving Global and Local Structures , 2016, IEEE Transactions on Image Processing.

[65]  Ioannis A. Kakadiaris,et al.  3D Human pose estimation: A review of the literature and analysis of covariates , 2016, Comput. Vis. Image Underst..

[66]  Richard S. Zemel,et al.  Unsupervised Learning of Skeletons from Motion , 2008, ECCV.

[67]  Andriy Myronenko,et al.  Point Set Registration: Coherent Point Drift , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  David A. Forsyth,et al.  Skeletal Parameter Estimation from Optical Motion Capture Data , 2005, CVPR.

[69]  Zhigang Deng,et al.  Unsupervised Articulated Skeleton Extraction From Point Set Sequences Captured by a Single Depth Camera , 2018, AAAI.

[70]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[71]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[72]  Luc Van Gool,et al.  Functional categorization of objects using real-time markerless motion capture , 2011, CVPR 2011.

[73]  Yueting Zhuang,et al.  Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2018, IEEE Transactions on Multimedia.

[74]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Zhigang Deng,et al.  Smooth skinning decomposition with rigid bones , 2012, ACM Trans. Graph..

[76]  Tong-Yee Lee,et al.  Skeleton extraction by mesh contraction , 2008, SIGGRAPH 2008.

[77]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[78]  Sebastian Thrun,et al.  Real-Time Human Pose Tracking from Range Data , 2012, ECCV.

[79]  Leslie Greengard,et al.  The Fast Gauss Transform , 1991, SIAM J. Sci. Comput..

[80]  Marc Pollefeys,et al.  A Factorization-Based Approach for Articulated Nonrigid Shape, Motion and Kinematic Chain Recovery From Video , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Shuicheng Yan,et al.  Body Surface Context: A New Robust Feature for Action Recognition From Depth Videos , 2014, IEEE Transactions on Circuits and Systems for Video Technology.