Context-Based Appearance Descriptor for 3D Human Pose Estimation from Monocular Images

In this paper we propose a novel appearance descriptor for 3D human pose estimation from monocular images using a learning-based technique. Our image-descriptor is based on the intermediate local appearance descriptors that we design to encapsulate local appearance context and to be resilient to noise. We encode the image by the histogram of such local appearance context descriptors computed in an image to obtain the final image-descriptor for pose estimation. We name the final image-descriptor the Histogram of Local Appearance Context (HLAC). We then use Relevance Vector Machine (RVM) regression to learn the direct mapping between the proposed HLAC image-descriptor space and the 3D pose space. Given a test image, we first compute the HLAC descriptor and then input it to the trained regressor to obtain the final output pose in real time. We compared our approach with other methods using a synchronized video and 3D motion dataset. We compared our proposed HLAC image-descriptor with the Histogram of Shape Context and Histogram of SIFT like descriptors. The evaluation results show that HLAC descriptor outperforms both of them in the context of 3D Human pose estimation.

[1]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[2]  Stefano Soatto,et al.  Fast Human Pose Estimation using Appearance and Motion via Multi-Dimensional Boosting Regression , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[4]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[6]  Ankur Agarwal,et al.  A Local Basis Representation for Estimating Human Pose from Cluttered Images , 2006, ACCV.

[7]  L. Davis,et al.  Background and foreground modeling using nonparametric kernel density estimation for visual surveillance , 2002, Proc. IEEE.

[8]  Yihong Gong,et al.  Discriminative learning of visual words for 3D human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Rómer Rosales,et al.  Learning Body Pose via Specialized Maps , 2001, NIPS.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.