A Two-Stage Bayesian Network Method for 3D Human Pose Estimation from Monocular Image Sequences

This paper proposes a novel human motion capture method that locates human body joint position and reconstructs the human pose in 3D space from monocular images. We propose a two-stage framework including 2D and 3D probabilistic graphical models which can solve the occlusion problem for the estimation of human joint positions. The 2D and 3D models adopt directed acyclic structure to avoid error propagation of inference. Image observations corresponding to shape and appearance features of humans are considered as evidence for the inference of 2D joint positions in the 2D model. Both the 2D and 3D models utilize the Expectation Maximization algorithm to learn prior distributions of the models. An annealed Gibbs sampling method is proposed for the two-stage method to inference the maximum posteriori distributions of joint positions. The annealing process can efficiently explore the mode of distributions and find solutions in high-dimensional space. Experiments are conducted on the HumanEva dataset with image sequences of walking motion, which has challenges of occlusion and loss of image observations. Experimental results show that the proposed two-stage approach can efficiently estimate more accurate human poses.

[1]  Nebojsa Jojic,et al.  Tracking articulated self - occluding objects in dense disparity maps , 1999 .

[2]  K. Rohr Towards model-based recognition of human movements in image sequences , 1994 .

[3]  A. Elgammal,et al.  Inferring 3D body pose from silhouettes using activity manifold learning , 2004, CVPR 2004.

[4]  Rómer Rosales,et al.  Inferring body pose without tracking body parts , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[5]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Camillo J. Taylor,et al.  Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image , 2000, Comput. Vis. Image Underst..

[7]  Eric V. Slud Graphical Models: Methods for Data Analysis and Mining , 2003, Technometrics.

[8]  Mansoor Sarhadi,et al.  Non-linear statistical models for the 3D reconstruction of human pose and motion from monocular image sequences , 2000, Image Vis. Comput..

[9]  Matthew Brand,et al.  Shadow puppetry , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[10]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Stefan Carlsson,et al.  Monocular 3D Reconstruction of Human Motion in Long Action Sequences , 2004, ECCV.

[12]  Ronald Poppe,et al.  Vision-based human motion analysis: An overview , 2007, Comput. Vis. Image Underst..

[13]  Andrew Zisserman,et al.  Tracking People by Learning Their Appearance , 2007 .

[14]  Sidharth Bhatia,et al.  Tracking loose-limbed people , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[15]  Mun Wai Lee,et al.  A model-based approach for estimating human 3D poses in static images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Alfredo Gardel Vicente,et al.  Unsupervised and adaptive Gaussian skin-color model , 2000, Image Vis. Comput..

[17]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  Nicolas Thome,et al.  Human Body Part Labeling and Tracking Using Graph Matching Theory , 2006, 2006 IEEE International Conference on Video and Signal Based Surveillance.

[19]  Chung-Lin Huang,et al.  Human Motion Parameter Capturing Using Particle Filter and Nonparametric Belief Propagation , 2008, 2008 IEEE Southwest Symposium on Image Analysis and Interpretation.

[20]  Tony Lindeberg,et al.  Feature Detection with Automatic Scale Selection , 1998, International Journal of Computer Vision.

[21]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[22]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[23]  Michael J. Black,et al.  A Quantitative Evaluation of Video-based 3D Person Tracking , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[24]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[25]  Gang Hua,et al.  Learning to estimate human pose with data driven belief propagation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  David A. Forsyth,et al.  Tracking People by Learning Their Appearance , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[28]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[29]  Ankur Agarwal,et al.  A Local Basis Representation for Estimating Human Pose from Cluttered Images , 2006, ACCV.

[30]  Shaogang Gong,et al.  A Dynamic 3D Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space , 1999, BMVC.

[31]  Trevor Darrell,et al.  Bayesian Articulated Tracking Using Single Frame Pose Sampling , 2003 .