Metric Regression Forests for Human Pose Estimation

Traditionally, human pose estimation algorithms could be classified into generative [2] and discriminative [4] approaches. Generative approaches model the likelihood of the observations given a pose estimate, however, they are susceptible to local minima and thus require good initial pose estimates. Discriminative approaches learn a direct mapping from image features to pose space from training data, however, they struggle to generalize to unseen poses. Building on previous work [3], Taylor et al. [5] bypass some of these limitations using a hybrid-approach that discriminatively predicts, for each pixel in a depth image, a corresponding point on the surface of a humanoid mesh model. This mesh model is then robustly fit to the resulting set of correspondences using local optimization. Surprisingly though, these correspondences are actually inferred using a random forest whose structure was trained using a classification objective that arbitrarily equates target model points belonging to the same predefined body part [3]. In this paper, we address Taylor et al.’s use of this proxy classification objective by proposing Metric Space Information Gain (MSIG), a replacement objective function for training a random forest to directly minimize the uncertainty over the target model points, naturally encoding the correlation between these points as a function of the geodesic distance. To this end, we view the surface of the model U as a metric space (U,dU) defined by the geodesic distance metric dU (see first panel of Figure 1). The natural objective function to minimize the uncertainty in the resulting true distributions that result from a split function s in such a space, is the information gain I(s) [1]. This is generally approximated using an empirical distribution Q = {ui} ⊆U drawn from the true unsplit distribution pU as I(s)≈ I(s;Q) = Ĥ(Q)− ∑ i∈{L,R} |Qi| |Q| Ĥ(Qi), (1)

[1]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[2]  Sebastian Thrun,et al.  Real time motion capture using a single time-of-flight camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Antonio Criminisi,et al.  Decision Forests for Computer Vision and Medical Image Analysis , 2013, Advances in Computer Vision and Pattern Recognition.

[4]  Hans-Peter Seidel,et al.  A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[5]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Wray L. Buntine,et al.  A further comparison of splitting rules for decision-tree induction , 2004, Machine Learning.

[7]  Hans-Peter Seidel,et al.  Fast articulated motion tracking using a sums of Gaussians body model , 2011, 2011 International Conference on Computer Vision.

[8]  Trevor Darrell,et al.  Sparse probabilistic regression for activity-independent human pose inference , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[10]  Cristian Sminchisescu,et al.  Feature-Based Pose Estimation , 2011, Visual Analysis of Humans.

[11]  Sebastian Nowozin,et al.  Improved Information Gain Estimates for Decision Tree Induction , 2012, ICML.

[12]  Bodo Rosenhahn,et al.  Model-Based Pose Estimation , 2011, Visual Analysis of Humans.

[13]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[14]  David J. Fleet,et al.  Shared Kernel Information Embedding for discriminative inference , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[16]  Hans-Peter Seidel,et al.  Outdoor human motion capture using inverse kinematics and von mises-fisher sampling , 2011, 2011 International Conference on Computer Vision.

[17]  Bodo Rosenhahn,et al.  Efficient and Robust Shape Matching for Model Based Human Motion Capture , 2011, DAGM-Symposium.

[18]  Ahmed M. Elgammal,et al.  Coupled Visual and Kinematic Manifold Models for Tracking , 2010, International Journal of Computer Vision.

[19]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[20]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[22]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[23]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[24]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[26]  Hans-Peter Seidel,et al.  Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[27]  Wei Zhong Liu,et al.  The Importance of Attribute Selection Measures in Decision Tree Induction , 1994, Machine Learning.

[28]  Sebastian Thrun,et al.  Real-Time Human Pose Tracking from Range Data , 2012, ECCV.

[29]  David J. Fleet,et al.  Physics-Based Person Tracking Using the Anthropomorphic Walker , 2010, International Journal of Computer Vision.

[30]  Jitendra Malik,et al.  Twist Based Acquisition and Tracking of Animal and Human Kinematics , 2004, International Journal of Computer Vision.