论文信息 - Multi-scale Aggregation R-CNN for 2D Multi-person Pose Estimation

Multi-scale Aggregation R-CNN for 2D Multi-person Pose Estimation

Multi-person pose estimation from a 2D image is challenging because it requires not only keypoint localization but also human detection. In state-of-the-art top-down methods, multi-scale information is a crucial factor for the accurate pose estimation because it contains both of local information around the keypoints and global information of the entire person. Although multi-scale information allows these methods to achieve the state-of-the-art performance, the top-down methods still require a huge amount of computation because they need to use an additional human detector to feed the cropped human image to their pose estimation model. To effectively utilize multi-scale information with the smaller computation, we propose a multi-scale aggregation R-CNN (MSA R-CNN). It consists of multi-scale RoIAlign block (MS-RoIAlign) and multi-scale keypoint head network (MS-KpsNet) which are designed to effectively utilize multi-scale information. Also, in contrast to previous top-down methods, the MSA R-CNN performs human detection and keypoint localization in a single model, which results in reduced computation. The proposed model achieved the best performance among single model-based methods and its results are comparable to those of separated model-based methods with a smaller amount of computation on the publicly available 2D multi-person keypoint localization dataset.

Kyoung Mu Lee | Gyeongsik Moon | Ju Yong Chang | Gyeongsik Moon

[1] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Yichen Wei,et al. Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[3] Yuning Jiang,et al. MegDet: A Large Mini-Batch Object Detector , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[5] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jonathan Tompson,et al. Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[11] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[12] Jitendra Malik,et al. Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Xiaogang Wang,et al. Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jonathan Tompson,et al. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[16] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Zhiao Huang,et al. Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[18] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19] Georgios Tzimiropoulos,et al. Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[20] Jian Yang,et al. Adversarial Learning of Structure-Aware Fully Convolutional Networks for Landmark Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Pietro Perona,et al. Benchmarking and Error Diagnosis in Multi-instance Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22] Emre Akbas,et al. MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network , 2018, ECCV.

[23] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[24] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[25] Peter V. Gehler,et al. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Kyoung Mu Lee,et al. PoseFix: Model-Agnostic General Human Pose Refinement Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Gang Yu,et al. Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Bernt Schiele,et al. DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[29] Cewu Lu,et al. RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[31] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[32] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[33] Dacheng Tao,et al. A Coarse-Fine Network for Keypoint Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).