Entropy Driven Hierarchical Search for 3D Human Pose Estimation

3D Human pose estimation from a single monocular image is an extremely difficult problem. Currently there are two main approaches to solving this problem, the first is to learn a direct mapping from image features to 3D pose [1], the second is to first extract 2D pose as an intermediate stage and then ‘lift’ this to a 3D pose [2]. The limitation with both of these approaches is that they are only applicable to poses that are similar to those represented in the original training set, e.g. walking. It is unlikely they will scale to extract arbitrary 3D poses. Contrary to this, in the domain of 2D pose estimation current state-of-the-art methods have been shown capable of detecting poses that are much more varied [3]. This has been achieved using generative models built around the Pictorial Structures representation that decomposes pose estimation into a search across individual parts [4]. In this paper we present a generative method to extract 3D pose from single images using a part based representation. The method is stochastic, though in contrast to methods used for 3D tracking (e.g. the particle filter), where the search space in each frame is tightly constrained by previous observations, in single image pose estimation the search space is much larger. To permit a search over this space a generative prior model is learnt from motion capture data. Stochastic samples are used to approximate this prior and to facilitate its update. In effect, the initial prior is iteratively deformed to the posterior distribution. The body is represented by a set of ten parts, each part has a fixed length and connected parts are forced to join at fixed locations. The conditional distribution between two connected parts is modeled by first learning a joint distribution using a GMM p(xi,x j∣θi j), where xi and x j is the state of the ith and jth part respectively and θi j is the set of model parameters. As each model is represented using a GMM the model parameters are defined as θi j = {λ k i j,μ i j,Σi j}k=1, where K is the number of components in the model and λ k i j,μ k i j,Σ k i j represent the kth component’s weight, mean and covariance respectively. For efficiency all covariances used to represent limb conditionals are diagonal and can be partitioned such that Σi j = diag(Λ k ii,Λ k j j) and likewise μ k i j = ( μk i ,μ k j ) . Given a value for x j (e.g. a sample) the conditional distribution p(xi∣x j,θ k i j) is first calculated from the joint distribution p(xi,x j∣θi j), following which a sample xi can be drawn from it. The conditional distribution, p(xi∣x j,θ k i j), is also a GMM with parameters {λ k i ,μk i ,Λii}k=1. The component weights are proportional to the marginal distribution λ k i ∝ p(x j∣θ k i j), which is calculated from the normal distribution p(x j∣θ k i j) = λ k i jN (x j; μj ,Λj j). Note this conditional model is different to typical approximations used, when the conditional model is approximated by p(xi j∣θi j), where xi j is the value of xi in the local frame of reference of x j [3]. A benefit of learning a full conditional model between neighboring parts is that different GMM components learnt in quaternion space correspond to different spatial locations in R3. This is illustrated in Fig. 1 where it can clearly be seen that this representation can clearly capture multiple modes.

[1]  Daniel P. Huttenlocher,et al.  Beyond trees: common-factor models for 2D human pose recovery , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[3]  AgarwalAnkur,et al.  Recovering 3D Human Pose from Monocular Images , 2006 .

[4]  Stefano Soatto,et al.  Relevant Feature Selection for Human Pose Estimation and Localization in Cluttered Images , 2008, ECCV.

[5]  Daniel P. Huttenlocher,et al.  A unified spatio-temporal articulated model for tracking , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[6]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[7]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  David A. Forsyth,et al.  Finding and tracking people from the bottom up , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[11]  Michael Isard,et al.  Partitioned Sampling, Articulated Objects, and Interface-Quality Hand Tracking , 2000, ECCV.

[12]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[13]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[15]  Sidharth Bhatia,et al.  Tracking loose-limbed people , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[16]  Jianbo Shi,et al.  Multiple frame motion inference using belief propagation , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[17]  Mun Wai Lee,et al.  A model-based approach for estimating human 3D poses in static images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[19]  Xianghua Xie,et al.  Estimating 3D Pose via Stochastic Search and Expectation Maximization , 2010, AMDO.

[20]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[21]  G. Hua,et al.  Variational maximum a posteriori by annealed mean field analysis , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[23]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.