Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multi-person joint assembling task. By formulating joint association as maximum-weight bipartite matching, a differentiable solution is developed to exploit projected gradient descent and Dykstra’s cyclic projection algorithm. This makes our method end-to-end trainable and allows back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Experiments on three instance-aware human parsing datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.

[1]  Song-Chun Zhu,et al.  Hierarchical Human Semantic Parsing With Comprehensive Part-Relation Modeling , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Song-Chun Zhu,et al.  Cascaded Parsing of Human-Object Interaction Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Lu Yang,et al.  Renovating Parsing R-CNN for Accurate Multiple Human Parsing , 2020, ECCV.

[4]  Wenhai Wang,et al.  Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation , 2020, ECCV.

[5]  Stephen Lin,et al.  Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation , 2020, ECCV.

[6]  Ming Tang,et al.  Part-Aware Context Network for Human Parsing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Liang Zheng,et al.  Correlating Edge, Pose With Parsing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ling Shao,et al.  Hierarchical Human Parsing With Typed Part-Relation Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  R. Urtasun,et al.  Dense RepPoints: Representing Visual Objects with Dense Point Sets , 2019, ECCV.

[10]  Feiyue Huang,et al.  Learning Semantic Neural Tree for Human Parsing , 2019, ECCV.

[11]  Jia Li,et al.  Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation , 2019, AAAI.

[12]  Ling Shao,et al.  Human-Aware Motion Deblurring , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Ling Shao,et al.  Learning Compositional Neural Information Fusion for Human Parsing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Ping Luo,et al.  PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sanja Fidler,et al.  DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Song-Chun Zhu,et al.  Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Shuicheng Yan,et al.  Single-Stage Multi-Person Pose Machines , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Chen Change Loy,et al.  CARAFE: Content-Aware ReAssembly of FEatures , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Stephen Lin,et al.  RepPoints: Point Set Representation for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Stefan Roth,et al.  Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xinlei Chen,et al.  TensorMask: A Foundation for Dense Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Alexandre Alahi,et al.  PifPaf: Composite Fields for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ming Jiang,et al.  Parsing R-CNN for Instance-Level Human Analysis , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yunchao Wei,et al.  Devil in the Details: Towards Accurate Single and Multiple Human Parsing , 2018, AAAI.

[25]  Shuicheng Yan,et al.  Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation , 2018, ECCV.

[26]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[27]  Ming Yang,et al.  Instance-level Human Parsing via Part Grouping Network , 2018, ECCV.

[28]  Song-Chun Zhu,et al.  Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Cewu Lu,et al.  Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Yu Cheng,et al.  Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing , 2018, ACM Multimedia.

[31]  Paul L. Rosin,et al.  Pose2Seg: Detection Free Human Instance Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jonathan Tompson,et al.  PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model , 2018, ECCV.

[33]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[34]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yao Sun,et al.  Cross-domain Human Parsing via Adversarial Feature and Label Adaptation , 2018, AAAI.

[36]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Dacheng Tao,et al.  A Coarse-Fine Network for Keypoint Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Philip H. S. Torr,et al.  Holistic, Instance-Level Human Parsing , 2017, BMVC.

[39]  Deqing Sun,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Peng Wang,et al.  Joint Multi-person Pose Estimation and Semantic Part Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Yunchao Wei,et al.  Multiple-Human Parsing in the Wild , 2017 .

[43]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[44]  Ke Gong,et al.  Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Yaser Sheikh,et al.  Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[50]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[51]  Jun Zhu,et al.  Pose-Guided Human Parsing by an AND/OR Graph Using Pose-Context Features , 2016, AAAI.

[52]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Shuicheng Yan,et al.  Human Parsing with Contextualized Convolutional Neural Network , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Alan L. Yuille,et al.  Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net , 2015, ECCV.

[56]  Max Jaderberg,et al.  Spatial Transformer Networks , 2015, NIPS.

[57]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[58]  Jian Dong,et al.  Towards Unified Human Parsing and Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60]  Luis E. Ortiz,et al.  Parsing clothing in fashion photographs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[62]  R. Mathar,et al.  A cyclic projection algorithm via duality , 1989 .

[63]  Paul H. Calamai,et al.  Projected gradient methods for linearly constrained problems , 1987, Math. Program..

[64]  A. Schrijver Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[65]  R. Dykstra An Algorithm for Restricted Least Squares Regression , 1983 .

[66]  Mario Osvin Pavčević,et al.  Introduction to graph theory , 1973, The Mathematical Gazette.

[67]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[68]  R. Dykstra,et al.  A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces , 1986 .