Rethinking on Multi-Stage Networks for Human Pose Estimation

Existing pose estimation approaches fall into two categories: single-stage and multi-stage methods. While multi-stage methods are seemingly more suited for the task, their performance in current practice is not as good as single-stage methods. This work studies this issue. We argue that the current multi-stage methods' unsatisfactory performance comes from the insufficiency in various design choices. We propose several improvements, including the single-stage module design, cross stage feature aggregation, and coarse-to-fine supervision. The resulting method establishes the new state-of-the-art on both MS COCO and MPII Human Pose dataset, justifying the effectiveness of a multi-stage architecture. The source code is publicly available for further research.

[1]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jitendra Malik,et al.  Articulated Pose Estimation Using Discriminative Armlet Classifiers , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[4]  Peter V. Gehler,et al.  Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Mark Everingham,et al.  Learning effective human pose estimation from inaccurate annotation , 2011, CVPR 2011.

[6]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[7]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[8]  Honggang Qi,et al.  Multi-Scale Structure-Aware Network for Human Pose Estimation , 2018, ECCV.

[9]  Ruigang Yang,et al.  Human Pose Estimation with Spatial Contextual Information , 2019, ArXiv.

[10]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Dimitris N. Metaxas,et al.  Quantized Densely Connected U-Nets for Efficient Landmark Localization , 2018, ECCV.

[14]  Andrew Zisserman,et al.  Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[15]  Xiu-Shen Wei,et al.  Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Yuning Jiang,et al.  MegDet: A Large Mini-Batch Object Detector , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Georgios Tzimiropoulos,et al.  Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[21]  Xiangyu Zhang,et al.  DetNet: A Backbone network for Object Detection , 2018, ArXiv.

[22]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Zhi Zhang,et al.  Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation , 2017, IEEE Transactions on Multimedia.

[24]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Luc Van Gool,et al.  Human Pose Estimation Using Body Parts Dependent Joint Regressors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[29]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[33]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[36]  Hwann-Tzong Chen,et al.  Self Adversarial Training for Human Pose Estimation , 2017, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[37]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ying Wu,et al.  Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[41]  David Picard,et al.  Human Pose Regression by Combining Indirect Part Detection and Contextual Information , 2017, Comput. Graph..

[42]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[43]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[45]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[47]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[48]  Ben Taskar,et al.  Adaptive pose priors for pictorial structures , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.