Recovering 3D human pose from monocular images

We describe a learning-based method for recovering 3D human body pose from single images and monocular image sequences. Our approach requires neither an explicit body model nor prior labeling of body parts in the image. Instead, it recovers pose by direct nonlinear regression against shape descriptor vectors extracted automatically from image silhouettes. For robustness against local silhouette segmentation errors, silhouette shape is encoded by histogram-of-shape-contexts descriptors. We evaluate several different regression methods: ridge regression, relevance vector machine (RVM) regression, and support vector machine (SVM) regression over both linear and kernel bases. The RVMs provide much sparser regressors without compromising performance, and kernel bases give a small but worthwhile improvement in performance. The loss of depth and limb labeling information often makes the recovery of 3D pose from single silhouettes ambiguous. To handle this, the method is embedded in a novel regressive tracking framework, using dynamics from the previous state estimate together with a learned regression value to disambiguate the pose. We show that the resulting system tracks long sequences stably. For realism and good generalization over a wide range of viewpoints, we train the regressors on images resynthesized from real human motion capture data. The method is demonstrated for several representations of full body pose, both quantitatively on independent but similar test data and qualitatively on real image sequences. Mean angular errors of 4-6/spl deg/ are obtained for a variety of walking motions.

[1]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[2]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[3]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[4]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[5]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[6]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[7]  Matthew Brand,et al.  Shadow puppetry , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[8]  William T. Freeman,et al.  Bayesian Reconstruction of 3D Human Motion from Single-Camera Video , 1999, NIPS.

[9]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[10]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[11]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[12]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[13]  Michael J. Black,et al.  Learning and Tracking Cyclic Human Motion , 2000, NIPS.

[14]  Camillo J. Taylor,et al.  Reconstruction of articulated objects from point correspondences in a single uncalibrated image , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[15]  Camillo J. Taylor,et al.  Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image , 2000, Comput. Vis. Image Underst..

[16]  Rómer Rosales,et al.  Inferring body pose without tracking body parts , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[19]  Stefan Schaal,et al.  Learning inverse kinematics , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[20]  Cristian Sminchisescu,et al.  Covariance scaled sampling for monocular 3D body tracking , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[21]  Andrew Blake,et al.  Probabilistic tracking in a metric space , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[22]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[23]  Michael J. Black,et al.  Implicit Probabilistic Models of Human Motion for Synthesis and Tracking , 2002, ECCV.

[24]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[25]  Michel Dhome,et al.  Hyperplane Approximation for Template Matching , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Stan Sclaroff,et al.  Estimating 3D hand pose from a cluttered image , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[27]  Cristian Sminchisescu,et al.  Kinematic jump processes for monocular 3D human tracking , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[28]  Andrew Blake,et al.  A sparse probabilistic learning algorithm for real-time tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[29]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30]  Björn Stenger,et al.  Filtering using a tree-based estimator , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[31]  Trevor Darrell,et al.  Inferring 3D structure with a statistical image-based shape model , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[32]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[33]  Ankur Agarwal,et al.  3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[34]  Ankur Agarwal,et al.  Tracking Articulated Motion Using a Mixture of Autoregressive Models , 2004, ECCV.

[35]  Ankur Agarwal,et al.  Learning to track 3D human motion from silhouettes , 2004, ICML.

[36]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[37]  B. Triggs,et al.  Tracking Articulated Motion with Piecewise Learned Dynamical Models , 2004 .