BraidNet: Braiding Semantics and Details for Accurate Human Parsing

This paper focuses on fine-grained human parsing in images. This is a very challenging task due to the diverse person appearance, semantic ambiguity of different body parts and clothing, and extremely small parsing targets. Although existing approaches can achieve significant improvement by pyramid feature learning, multi-level supervision, and joint learning with pose estimation, human parsing is still far from being solved. Different from existing approaches, we propose a Braiding Network, named as BraidNet, to learn complementary semantics and details for fine-grained human parsing. The BraidNet contains a two-stream braid-like architecture. The first stream is a semantic abstracting net with a deep yet narrow structure which can learn semantic knowledge by a hierarchy of fully convolution layers to overcome the challenges of diverse person appearance. To capture low-level details of small targets, the detail-preserving net is designed to exploit a shallow yet wide network without down-sampling, which can retain sufficient local structures for small objects. Moreover, we design a group of braiding modules across the two sub-nets, by which complementary information can be exchanged during end-to-end training. Besides, in the end of BraidNet, a Pairwise Hard Region Embedding strategy is propose to eliminate the semantic ambiguity of different body parts and clothing. Extensive experiments show that the proposed BraidNet achieves better performance than the state-of-the-art methods for fine-grained human parsing.

[1]  Wu Liu,et al.  Beyond Human-level License Plate Super-resolution with Progressive Vehicle Search and Domain Priori GAN , 2017, ACM Multimedia.

[2]  Ruimin Hu,et al.  Multi-Correlation Filters With Triangle-Structure Constraints for Object Tracking , 2019, IEEE Transactions on Multimedia.

[3]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[4]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Xiaodan Liang,et al.  Human Parsing with Contextualized Convolutional Neural Network. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[6]  Tao Mei,et al.  PROVID: Progressive and Multimodal Vehicle Reidentification for Large-Scale Urban Surveillance , 2018, IEEE Transactions on Multimedia.

[7]  Muhittin Gokmen,et al.  Human Semantic Parsing for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[9]  Yi Yang,et al.  Macro-Micro Adversarial Network for Human Parsing , 2018, ECCV.

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chao Zhang,et al.  Hard-Aware Deeply Cascaded Embedding , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Yunchao Wei,et al.  Devil in the Details: Towards Accurate Single and Multiple Human Parsing , 2018, AAAI.

[14]  Cewu Lu,et al.  Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Tao Mei,et al.  Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yunchao Wei,et al.  Weakly Supervised Scene Parsing with Point-based Distance Metric Learning , 2018, AAAI.

[17]  Liang Lin,et al.  Adaptive Temporal Encoding Network for Video Instance-level Human Parsing , 2018, ACM Multimedia.

[18]  Jinwoo Shin,et al.  InstaGAN: Instance-aware Image-to-Image Translation , 2018, ICLR.

[19]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[20]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yu Cheng,et al.  Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing , 2018, ACM Multimedia.

[22]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[23]  Liang Lin,et al.  Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[25]  Tamara L. Berg,et al.  Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Ke Gong,et al.  Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[31]  Luis E. Ortiz,et al.  Parsing clothing in fashion photographs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Yi Yang,et al.  Concepts Not Alone: Exploring Pairwise Relationships for Zero-Shot Video Activity Recognition , 2016, AAAI.

[33]  Klaus H. Maier-Hein,et al.  A Probabilistic U-Net for Segmentation of Ambiguous Images , 2018, NeurIPS.

[34]  Wu Liu,et al.  T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ming Yang,et al.  Instance-level Human Parsing via Part Grouping Network , 2018, ECCV.

[37]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[38]  Xiangjian He,et al.  Trusted Guidance Pyramid Network for Human Parsing , 2018, ACM Multimedia.

[39]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Tao Mei,et al.  A Deep Learning-Based Approach to Progressive Vehicle Re-identification for Urban Surveillance , 2016, ECCV.

[42]  Xiaochun Cao,et al.  Fashion Parsing With Video Context , 2015, IEEE Trans. Multim..