Direct Dense Pose Estimation

Dense human pose estimation is the problem of learning dense correspondences between RGB images and the surfaces of human bodies, which finds various applications, such as human body reconstruction, human pose transfer, and human action recognition. Prior dense pose estimation methods are all based on Mask R-CNN framework and operate in a top-down manner of first attempting to identify a bounding box for each person and matching dense correspondences in each bounding box. Consequently, these methods lack robustness due to their critical dependence on the Mask R-CNN detection, and the runtime increases drastically as the number of persons in the image increases. We therefore propose a novel alternative method for solving the dense pose estimation problem, called Direct Dense Pose (DDP). DDP first predicts the instance mask and global IUV representation separately and then combines them together. We also propose a simple yet effective 2D temporal-smoothing scheme to alleviate the temporal jitters when dealing with video data. Experiments demonstrate that DDP overcomes the limitations of previous top-down baseline methods and achieves competitive accuracy. In addition, DDP is computationally more efficient than previous dense pose estimation methods, and it reduces jitters when applied to a video sequence, which is a problem plaguing the previous methods.

[1]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Laurens van der Maaten,et al.  3D Semantic Segmentation with Submanifold Sparse Convolutional Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Luc Van Gool,et al.  Semantic Instance Segmentation with a Discriminative Loss Function , 2017, ArXiv.

[5]  Christoph Bregler,et al.  SimPose: Effectively Learning DensePose and Surface Normals of People from Simulated Data , 2020, ECCV.

[6]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[7]  Yongchao Gong,et al.  Mask Scoring R-CNN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[10]  Bernt Schiele,et al.  ArtTrack: Articulated Multi-Person Tracking in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Guan Huang,et al.  The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chunhua Shen,et al.  DirectPose: Direct End-to-End Multi-Person Pose Estimation , 2019, ArXiv.

[13]  Iasonas Kokkinos,et al.  Slim DensePose: Thrifty Learning From Sparse Annotations and Motion Cues , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Lior Wolf,et al.  Vid2Game: Controllable Characters Extracted from Real-World Videos , 2019, ICLR.

[15]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrea Vedaldi,et al.  Continuous Surface Embeddings , 2020, NeurIPS.

[19]  Yuning Jiang,et al.  SOLO: Segmenting Objects by Locations , 2019, ECCV.

[20]  Alexandre Alahi,et al.  PifPaf: Composite Fields for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Song-Chun Zhu,et al.  DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Gianni Vernazza,et al.  Image stabilization algorithms for video-surveillance applications , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[24]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[26]  Yong Jae Lee,et al.  YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Alexei A. Efros,et al.  Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild , 2020, ECCV.

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  Dacheng Tao,et al.  A Coarse-Fine Network for Keypoint Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Mao Ye,et al.  Distribution-Aware Coordinate Representation for Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Christian Theobalt,et al.  Neural Re-rendering of Humans from a Single Image , 2021, ECCV.

[33]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[34]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Tao Kong,et al.  SOLOv2: Dynamic and Fast Instance Segmentation , 2020, NeurIPS.

[36]  Ming Jiang,et al.  Parsing R-CNN for Instance-Level Human Analysis , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Jian Wang,et al.  Graph-PCNN: Two Stage Human Pose Estimation with Graph Pose Refinement , 2020, ECCV.

[39]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[40]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Zhenan Sun,et al.  Learning 3D Human Shape and Pose From Dense Body Parts , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Jingkuan Song,et al.  Adaptive Multi-Path Aggregation for Human DensePose Estimation in the Wild , 2019, ACM Multimedia.

[43]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Heng Tao Shen,et al.  KTN: Knowledge Transfer Network for Multi-person DensePose Estimation , 2020, ACM Multimedia.

[45]  Eldar Insafutdinov Towards accurate multi-person pose estimation in the wild , 2020 .

[46]  Azeddine Beghdadi,et al.  A performance evaluation framework for video stabilization methods , 2018, 2018 7th European Workshop on Visual Information Processing (EUVIP).

[47]  Andrea Vedaldi,et al.  Correlated Uncertainty for Learning Dense Correspondences from Noisy Labels , 2019, NeurIPS.

[48]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[49]  Victor Lempitsky,et al.  Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Wenhai Wang,et al.  Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation , 2020, ECCV.

[53]  Sanja Fidler,et al.  SGN: Sequential Grouping Networks for Instance Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Xinlei Chen,et al.  TensorMask: A Foundation for Dense Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).