Progressive search space reduction for human pose estimation

The objective of this paper is to estimate 2D human pose as a spatial configuration of body parts in TV and movie video shots. Such video material is uncontrolled and extremely challenging. We propose an approach that progressively reduces the search space for body parts, to greatly improve the chances that pose estimation will succeed. This involves two contributions: (i) a generic detector using a weak model of pose to substantially reduce the full pose search space; and (ii) employing 'grabcut' initialized on detected regions proposed by the weak model, to further prune the search space. Moreover, we also propose (Hi) an integrated spatio- temporal model covering multiple frames to refine pose estimates from individual frames, with inference using belief propagation. The method is fully automatic and self-initializing, and explains the spatio-temporal volume covered by a person moving in a shot, by soft-labeling every pixel as belonging to a particular body part or to the background. We demonstrate upper-body pose estimation by an extensive evaluation over 70000 frames from four episodes of the TV series Buffy the vampire slayer, and present an application to full- body action recognition on the Weizmann dataset.

[1]  Luc Van Gool,et al.  Real-time affine region tracking and coplanar grouping , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2]  Cristian Sminchisescu,et al.  Estimating Articulated Human Motion with Covariance Scaled Sampling , 2003, Int. J. Robotics Res..

[3]  Ankur Agarwal,et al.  3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[4]  Andrew Zisserman,et al.  Learning Layered Pictorial Structures from Video , 2004, ICVGIP.

[5]  Ankur Agarwal,et al.  Tracking Articulated Motion Using a Mixture of Autoregressive Models , 2004, ECCV.

[6]  Michael Isard,et al.  Tracking loose-limbed people , 2004, CVPR 2004.

[7]  Alexei A. Efros,et al.  Recovering human body configurations: combining segmentation and recognition , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[9]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[10]  David A. Forsyth,et al.  Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  Andrew Zisserman,et al.  Person Spotting: Video Shot Retrieval for Face Sets , 2005, CIVR.

[13]  Andrew Zisserman,et al.  Learning Layered Motion Segmentations of Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Nebojsa Jojic,et al.  Escaping local minima through hierarchical model selection: Automatic object discovery, segmentation, and tracking in video , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Michael J. Black,et al.  Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Vincent Lepetit,et al.  Feature Harvesting for Tracking-by-Detection , 2006, ECCV.

[18]  Pushmeet Kohli,et al.  PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts , 2006, ECCV.

[19]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[20]  Antonio Criminisi,et al.  Single-Histogram Class Models for Image Segmentation , 2006, ICVGIP.

[21]  Ivan Laptev,et al.  Improvements of Object Detection Using Boosted Histograms , 2006, BMVC.

[22]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Pinar Duygulu Sahin,et al.  Human Action Recognition Using Distribution of Oriented Rectangular Patches , 2007, Workshop on Human Motion.

[24]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[26]  Larry S. Davis,et al.  An Interactive Approach to Pose-Assisted and Appearance-based Segmentation of Humans , 2007, 2007 IEEE 11th International Conference on Computer Vision.