Progressive Cognitive Human Parsing

Human parsing is an important task for human-centric understanding. Generally, two mainstreams are used to deal with this challenging and fundamental problem. The first one is employing extra human pose information to generate hierarchical parse graph to deal with human parsing task. Another one is training an end-to-end network with the semantic information in image level. In this paper, we develop an end-to-end progressive cognitive network to segment human parts. In order to establish a hierarchical relationship, a novel component-aware region convolution structure is proposed. With this structure, latter layers inherit prior component information from former layers and pay its attention to a finer component. In this way, we deal with human parsing as a progressive recognition task, that is, we first locate the whole human and then segment the hierarchical components gradually. The experiments indicate that our method has a better location capacity for the small objects and a better classification capacity for the large objects. Moreover, our framework can be embedded into any fully convolutional network to enhance the performance significantly.

[1]  Song-Chun Zhu,et al.  Attribute And-Or Grammar for Joint Parsing of Human Pose, Parts and Attributes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xiaodan Liang,et al.  Human Parsing with Contextualized Convolutional Neural Network. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Xiaoxiao Li,et al.  Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Alan L. Yuille,et al.  Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net , 2015, ECCV.

[7]  Shimon Ullman,et al.  Image interpretation by a single bottom-up top-down cycle , 2008, Proceedings of the National Academy of Sciences.

[8]  Xiaogang Wang,et al.  Unsupervised Salience Learning for Person Re-identification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Changsheng Xu,et al.  Hi, magic closet, tell me what to wear! , 2012, ACM Multimedia.

[13]  Ke Gong,et al.  Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Stéphane Lecoeuche,et al.  Application of an incremental SVM algorithm for on-line human recognition from video surveillance using texture and color features , 2014, Neurocomputing.

[18]  Martial Hebert,et al.  Stacked Hierarchical Labeling , 2010, ECCV.

[19]  Peng Wang,et al.  Joint Multi-person Pose Estimation and Semantic Part Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Luis E. Ortiz,et al.  Parsing clothing in fashion photographs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yunchao Wei,et al.  Towards Real World Human Parsing: Multiple-Human Parsing in the Wild , 2017, ArXiv.

[23]  Jun Zhu,et al.  Pose-Guided Human Parsing by an AND/OR Graph Using Pose-Context Features , 2016, AAAI.

[24]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).