Learning Temporal Pose Estimation from Sparsely-Labeled Videos

Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames---a labeled Frame A and an unlabeled Frame B---we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually-labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs 39M parameters), and also more accurate (88.7% mAP vs 83.8% mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented dataset obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows our system to achieve state-of-the-art pose detection results on the PoseTrack2017 and PoseTrack2018 datasets. Code has been made available at: this https URL.

[1]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Shih-En Wei Convolutional Pose Machines : A Deep Architecture for Estimating Articulated Poses , 2016 .

[5]  Tang Tang,et al.  Multi-Domain Pose Network for Multi-Person Pose Estimation and Tracking , 2018, ECCV Workshops.

[6]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[7]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew Zisserman,et al.  Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[9]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[11]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[12]  Juergen Gall,et al.  PoseTrack: Joint Multi-person Pose Estimation and Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Daniel P. Huttenlocher,et al.  Beyond trees: common-factor models for 2D human pose recovery , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[16]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[18]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Yang Wang,et al.  Multiple Tree Models for Occlusion and Spatial Constraints in Human Pose Estimation , 2008, ECCV.

[20]  Bernt Schiele,et al.  Articulated Multi-person Tracking in the Wild , 2016, ArXiv.

[21]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Junwei Han,et al.  PoseFlow: A Deep Motion Representation for Understanding Human Behaviors in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[30]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[32]  Andrew Zisserman,et al.  Self-supervised learning of a facial attribute embedding from video , 2018, BMVC.

[33]  Luc Van Gool,et al.  Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Allan Jabri,et al.  Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[38]  Andrew Zisserman,et al.  Personalizing Human Video Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Jianbo Shi,et al.  Convolutional Random Walk Networks for Semantic Image Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[42]  Xiaogang Wang,et al.  Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[44]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[45]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[47]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Peter V. Gehler,et al.  Strong Appearance and Expressive Spatial Models for Human Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[51]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Michael J. Black,et al.  Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[54]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Luc Van Gool,et al.  Human Pose Estimation Using Body Parts Dependent Joint Regressors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Jianbo Shi,et al.  DeepEdge: A multi-scale bifurcated deep network for top-down contour detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.