Parsing human motion with stretchable models

We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually [8, 28]. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.

[1]  Andrew Zisserman,et al.  Long Term Arm and Hand Tracking for Continuous Sign Language TV Broadcasts , 2008, BMVC.

[2]  Cordelia Schmid,et al.  Human Detection Based on a Probabilistic Assembly of Robust Part Detectors , 2004, ECCV.

[3]  David A. Forsyth,et al.  Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  David A. Forsyth,et al.  Improved Human Parsing with a Full Relational Model , 2010, ECCV.

[5]  Vittorio Ferrari,et al.  Better Appearance Models for Pictorial Structures , 2009, BMVC.

[6]  Ben Taskar,et al.  Sidestepping Intractable Inference with Structured Ensemble Cascades , 2010, NIPS.

[7]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[8]  Cordelia Schmid,et al.  Learning to Parse Pictures of People , 2002, ECCV.

[9]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[10]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[11]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Michael Isard,et al.  Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation , 2003, NIPS.

[13]  Ramakant Nevatia,et al.  Human Pose Tracking Using Multi-level Structured Models , 2006, ECCV.

[14]  Michael J. Black,et al.  Cardboard people: a parameterized model of articulated image motion , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[15]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[16]  Philip H. S. Torr,et al.  Randomized trees for human pose detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ben Taskar,et al.  Cascaded Models for Articulated Pose Estimation , 2010, ECCV.

[18]  Ben Taskar,et al.  Structured Prediction Cascades , 2010, AISTATS.

[19]  Jitendra Malik,et al.  Tracking as Repeated Figure/Ground Segmentation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Trevor Darrell,et al.  Sparse probabilistic regression for activity-independent human pose inference , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Sidharth Bhatia,et al.  Tracking loose-limbed people , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[22]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Nikos Komodakis,et al.  MRF Optimization via Dual Decomposition: Message-Passing Revisited , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[25]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[26]  Cristian Sminchisescu,et al.  Estimating Articulated Human Motion with Covariance Scaled Sampling , 2003, Int. J. Robotics Res..

[27]  Daniel P. Huttenlocher,et al.  Beyond trees: common-factor models for 2D human pose recovery , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[28]  Michael J. Black,et al.  An Adaptive Appearance Model Approach for Model-based Articulated Object Tracking , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .