Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

Deep ConvNets have been shown to be effective for the task of human pose estimation from single images. However, several challenging issues arise in the video-based case such as self-occlusion, motion blur, and uncommon poses with few or no examples in the training data. Temporal information can provide additional cues about the location of body joints and help to alleviate these issues. In this paper, we propose a deep structured model to estimate a sequence of human poses in unconstrained videos. This model can be efficiently trained in an end-to-end manner and is capable of representing the appearance of body joints and their spatio-temporal relationships simultaneously. Domain knowledge about the human body is explicitly incorporated into the network providing effective priors to regularize the skeletal structure and to enforce temporal consistency. The proposed end-to-end architecture is evaluated on two widely used benchmarks for video-based pose estimation (Penn Action and JHMDB datasets). Our approach outperforms several state-of-the-art methods.

[1]  Jitendra Malik,et al.  Recovering human body configurations using pairwise constraints between parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Peter V. Gehler,et al.  Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[5]  Silvio Savarese,et al.  Articulated part-based model for joint object detection and pose estimation , 2011, 2011 International Conference on Computer Vision.

[6]  Mubarak Shah,et al.  Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[8]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[10]  Cristian Sminchisescu,et al.  Estimating Articulated Human Motion with Covariance Scaled Sampling , 2003, Int. J. Robotics Res..

[11]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[13]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[15]  David A. Forsyth,et al.  Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Daniel P. Huttenlocher,et al.  Distance Transforms of Sampled Functions , 2012, Theory Comput..

[17]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[18]  Cordelia Schmid,et al.  Mixing Body-Part Sequences for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[20]  Juergen Gall,et al.  Pose for Action - Action for Pose , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[21]  Michael J. Black,et al.  Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[22]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Xiaogang Wang,et al.  End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jonathan Tompson,et al.  MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation , 2014, ACCV.

[25]  Katerina Fragkiadaki,et al.  Pose from Flow and Flow from Pose , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[29]  Ben Taskar,et al.  Parsing human motion with stretchable models , 2011, CVPR 2011.

[30]  Deva Ramanan,et al.  N-best maximal decoders for part models , 2011, 2011 International Conference on Computer Vision.

[31]  Jitendra Malik,et al.  Deformable part models are convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Brendan J. Frey,et al.  A Revolution: Belief Propagation in Graphs with Cycles , 1997, NIPS.

[35]  Ramakant Nevatia,et al.  Human Pose Tracking in Monocular Sequence Using Multilevel Structured Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[38]  Xiaogang Wang,et al.  Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Xiaogang Wang,et al.  Structured Feature Learning for Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[41]  Luc Van Gool,et al.  Human Pose Estimation Using Body Parts Dependent Joint Regressors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Cordelia Schmid,et al.  Estimating Human Pose with Flowing Puppets , 2013, 2013 IEEE International Conference on Computer Vision.