Efficient Human Pose Estimation from Single Depth Images

We describe two new approaches to human pose estimation. Both can quickly and accurately predict the 3D positions of body joints from a single depth image without using any temporal information. The key to both approaches is the use of a large, realistic, and highly varied synthetic set of training images. This allows us to learn models that are largely invariant to factors such as pose, body shape, field-of-view cropping, and clothing. Our first approach employs an intermediate body parts representation, designed so that an accurate per-pixel classification of the parts will localize the joints of the body. The second approach instead directly regresses the positions of body joints. By using simple depth pixel comparison features and parallelizable decision forests, both approaches can run super-real time on consumer hardware. Our evaluation investigates many aspects of our methods, and compares the approaches to each other and to the state of the art. Results on silhouettes suggest broader applicability to other imaging modalities.

[1]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[2]  B. A. Shepherd,et al.  An Appraisal of a Decision Tree Approach to Image Classification , 1983, IJCAI.

[3]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[4]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[5]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[6]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[7]  Dariu Gavrila,et al.  Pedestrian Detection from a Moving Vehicle , 2000, ECCV.

[8]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Michael J. Black,et al.  Implicit Probabilistic Models of Human Motion for Synthesis and Tracking , 2002, ECCV.

[10]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[11]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[13]  David A. Forsyth,et al.  Finding and tracking people from the bottom up , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  David A. Forsyth,et al.  Probabilistic Methods for Finding People , 2001, International Journal of Computer Vision.

[16]  Ankur Agarwal,et al.  3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[20]  Sidharth Bhatia,et al.  Tracking loose-limbed people , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[21]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Vincent Lepetit,et al.  Randomized trees for real-time keypoint recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Reinhard Koch,et al.  Nonlinear Body Pose Estimation from Depth Images , 2005, DAGM-Symposium.

[24]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[25]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[26]  Rüdiger Dillmann,et al.  Sensor fusion for 3D human body tracking with an articulated 3D body model , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[27]  Jamie Shotton,et al.  The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[29]  Kikuo Fujimura,et al.  Constrained Optimization for Human Pose Estimation from Depth Sequences , 2007, ACCV.

[30]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[31]  Andrew W. Fitzgibbon,et al.  The Joint Manifold Model for Semi-supervised Multi-valued Regression , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Cristian Sminchisescu,et al.  Semi-supervised Hierarchical Models for 3D Human Pose Reconstruction , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ronald Poppe,et al.  Vision-based human motion analysis: An overview , 2007, Comput. Vis. Image Underst..

[34]  Philip H. S. Torr,et al.  Randomized trees for human pose detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[36]  Zhuowen Tu,et al.  Auto-context and its application to high-level vision tasks , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Stefano Soatto,et al.  Relevant Feature Selection for Human Pose Estimation and Localization in Cluttered Images , 2008, ECCV.

[38]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Trevor Darrell,et al.  Sparse probabilistic regression for activity-independent human pose inference , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Yihong Gong,et al.  Discriminative learning of visual words for 3D human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Juergen Gall,et al.  Class-specific Hough forests for object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Robert Y. Wang,et al.  Real-time hand-tracking with a color glove , 2009, ACM Trans. Graph..

[44]  David J. Fleet,et al.  Physics-Based Person Tracking Using the Anthropomorphic Walker , 2010, International Journal of Computer Vision.

[45]  Aaron Hertzmann,et al.  Learning 3D mesh segmentation and labeling , 2010, ACM Trans. Graph..

[46]  Michael Arens,et al.  Human pose estimation with implicit shape models , 2010, ARTEMIS '10.

[47]  Sebastian Thrun,et al.  Real-time identification and localization of body parts from depth images , 2010, 2010 IEEE International Conference on Robotics and Automation.

[48]  Antonio Criminisi,et al.  Regression Forests for Efficient Anatomy Detection and Localization in CT Studies , 2010, MCV.

[49]  Sebastian Thrun,et al.  Real time motion capture using a single time-of-flight camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Gérard G. Medioni,et al.  Human pose estimation from a single view point, real-time range sensor , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[51]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[52]  Dimitris N. Metaxas,et al.  Entangled Decision Forests and Their Application for Semantic Segmentation of CT Images , 2011, IPMI.

[53]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[54]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[55]  Sebastian Nowozin,et al.  Improved Information Gain Estimates for Decision Tree Induction , 2012, ICML.

[56]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..