Expressive Models and Comprehensive Benchmark for 2D Human Pose Estimation

In this work we consider the challenging task of articulated human pose estimation in monocular images. Most of current methods in this area [4, 8, 16, 14] are based on the pictorial structures model (PS) and are composed of unary terms modelling body part appearance and pairwise terms between adjacent body parts and/or joints capturing their preferred spatial arrangement. In this work we advance the state of the art in articulated human pose estimation in three ways. First, we argue that modeling part dependencies between non-adjacent body parts is important for effective pose estimation (cf. Fig. 1). We propose a model [10] that incorporates higher order information between body parts by defining a conditional model in which all parts are a-priori connected, but which becomes a tractable PS model once the mid-level features are observed. This allows to effectively model dependencies between non-adjacent parts and retains an exact and efficient inference procedure in a tree-based model. Second, we explore various types of appearance representations with the aim to improve the body part hypotheses [11]. We argue that in order to obtain effective part detectors it is necessary to leverage both the pose specific appearance of body parts and the joint appearance of part constellations. We show that the proposed appearance representations are complementary and a combination of the best performing appearance model paired with a flexible image-conditioned spatial model achieves the best result. Third, we introduce a novel benchmark “MPII Human Pose” [3] that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future developments in human body models. This comprehensive dataset was collected using an established taxonomy of over 800 human activities. The collected images cover a wider variety of human activities than previous datasets including various recreational, occupational, and householding activities. People are captured from a wider range of viewpoints. In addition we provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, and activity labels. With these annotations we perform a detailed analysis [3, 12] of the leading 2D human pose estimation and activity recognition methods to understand success and failure cases for established models.

[1]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Mark Everingham,et al.  Learning effective human pose estimation from inaccurate annotation , 2011, CVPR 2011.

[3]  Bernt Schiele,et al.  Discriminative Appearance Models for Pictorial Structures , 2011, International Journal of Computer Vision.

[4]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Bernt Schiele,et al.  Fine-Grained Activity Recognition with Holistic and Pose Based Features , 2014, GCPR.

[7]  Peter V. Gehler,et al.  Strong Appearance and Expressive Spatial Models for Human Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  David R Bassett,et al.  2011 Compendium of Physical Activities: a second update of codes and MET values. , 2011, Medicine and science in sports and exercise.

[9]  Yang Wang,et al.  Learning hierarchical poselets for human parsing , 2011, CVPR 2011.

[10]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[12]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Jitendra Malik,et al.  Articulated Pose Estimation Using Discriminative Armlet Classifiers , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Peter V. Gehler,et al.  Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.