Object Part Parsing with Hierarchical Dual Transformer

Object part parsing involves segmenting objects into semantic parts, which has drawn great attention recently. The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. HDTR first generates the pyramid multi-granularity pixel representations under the supervision of the object part parsing maps at different semantic levels and then assigns each region an initial part embedding. Moreover, HDTR generates an edge pixel representation to extend the capability of the network to capture detailed information. Afterward, we design a Hierarchical Part Transformer to upgrade the part embeddings to their hierarchical counterparts with the assistance of the multi-granularity pixel representations. Next, we propose a Hierarchical Pixel Transformer to infer the hierarchical information from the part embeddings to enrich the pixel representations. Note that both transformer decoders rely on the structural relations between object parts, i.e., dependency, composition, and decomposition relations. The experiments on five large-scale datasets, i.e., LaPa, CelebAMask-HQ, CIHP, LIP and Pascal Animal, demonstrate that our method sets a new state-of-the-art performance for object part parsing.

[1]  S. Zafeiriou,et al.  Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yingming Wang,et al.  Anchor DETR: Query Design for Transformer-Based Object Detection , 2021, 2109.07107.

[4]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[5]  Tae-Kyun Kim,et al.  Face Parsing from RGB and Depth Using Cross-Domain Mutual Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[7]  Ying Wang,et al.  InverseForm: A Loss Function for Structured Boundary-Aware Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  K. Kato,et al.  Hierarchical Pyramid Representations for Semantic Segmentation , 2021, ArXiv.

[9]  Pietro Zanuttigh,et al.  Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  M. Pantic,et al.  RoI Tanh-polar Transformer Network for Face Parsing in the Wild , 2021, Image Vis. Comput..

[11]  Jie Zhou,et al.  SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular images , 2021, Neurocomputing.

[12]  Tao Mei,et al.  AGRNet: Adaptive Graph Representation Learning and Reasoning for Face Parsing , 2021, IEEE Transactions on Image Processing.

[13]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cewu Lu,et al.  TDAF: Top-Down Attention Framework for Vision Tasks , 2020, AAAI.

[15]  Matthieu Cord,et al.  PLOP: Learning without Forgetting for Continual Semantic Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Hailin Shi,et al.  Edge-aware Graph Representation Learning and Reasoning for Face Parsing , 2020, ECCV.

[17]  Jianping Shi,et al.  Improving Semantic Segmentation via Decoupled Body and Edge Supervision , 2020, ECCV.

[18]  Pietro Zanuttigh,et al.  GMNet: Graph Matching Network for Large Scale Part Semantic Segmentation in the Wild , 2020, ECCV.

[19]  Ming Tang,et al.  Part-Aware Context Network for Human Parsing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[21]  Jinglu Wang,et al.  Joint Semantic Segmentation and Boundary Detection Using Iterative Pyramid Contexts , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Hailin Shi,et al.  A New Dataset and Boundary-Attention Semantic Segmentation for Face Parsing , 2020, AAAI.

[23]  Ling Shao,et al.  Hierarchical Human Parsing With Typed Part-Relation Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xiaolin Hu,et al.  End-to-end face parsing via interlinked convolutional neural networks , 2020, Cognitive Neurodynamics.

[25]  A. Yuille,et al.  Learning From Synthetic Animals , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Dacheng Tao,et al.  Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing , 2019, AAAI.

[27]  Xilin Chen,et al.  Object-Contextual Representations for Semantic Segmentation , 2019, ECCV.

[28]  Lingyun Wu,et al.  MaskGAN: Towards Diverse and Interactive Facial Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ming Zeng,et al.  Face Parsing With RoI Tanh-Warping , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Meng Wang,et al.  Graphonomy: Universal Human Parsing via Graph Transfer Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jingdong Wang,et al.  OCNet: Object Context Network for Scene Parsing , 2018, ArXiv.

[33]  Ming Yang,et al.  Instance-level Human Parsing via Part Grouping Network , 2018, ECCV.

[34]  Zhi Liu,et al.  Depth-aware object instance segmentation , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[35]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xiaolin Hu,et al.  Interlinked Convolutional Neural Networks for Face Parsing , 2015, ISNN.

[37]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Xilin Chen,et al.  HRFormer: High-Resolution Vision Transformer for Dense Predict , 2021, NeurIPS.

[40]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).