Human Attribute Recognition by Deep Hierarchical Contexts

We present an approach for recognizing human attributes in unconstrained settings. We train a Convolutional Neural Network (CNN) to select the most attribute-descriptive human parts from all poselet detections, and combine them with the whole body as a pose-normalized deep representation. We further improve by using deep hierarchical contexts ranging from human-centric level to scene level. Human-centric context captures human relations, which we compute from the nearest neighbor parts of other people on a pyramid of CNN feature maps. The matched parts are then average pooled and they act as a similarity regularization. To utilize the scene context, we re-score human-centric predictions by the global scene classification score jointly learned in our CNN, yielding final scene-aware predictions. To facilitate our study, a large-scale WIDER Attribute dataset(Dataset URL: http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute) is introduced with human attribute and image event annotations, and our method surpasses competitive baselines on this dataset and other popular ones.

[1]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Pietro Perona,et al.  Fine-grained classification of pedestrians in video: Benchmark and state of the art , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Paul A. Viola,et al.  A unified learning framework for real time face detection and classification , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[5]  Chen Huang,et al.  Unsupervised Learning of Discriminative Attributes and Visual Representations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Larry S. Davis,et al.  Multi-Task Learning with Low Rank Attribute Embedding for Person Re-Identification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Pietro Perona,et al.  Improved Bird Species Recognition Using Pose Normalized Deep Convolutional Nets , 2014, BMVC.

[10]  Bastian Leibe,et al.  Person Attribute Recognition with a Jointly-Trained Holistic CNN Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[11]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Shaogang Gong,et al.  Person Re-identification by Attributes , 2012, BMVC.

[14]  Antonio Torralba,et al.  Object Recognition by Scene Alignment , 2007, NIPS.

[15]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[17]  Song-Chun Zhu,et al.  Human Attribute Recognition by Rich Appearance Dictionary , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[19]  Dahua Lin,et al.  Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cordelia Schmid,et al.  Expanded Parts Model for Semantic Description of Humans in Still Images , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Shree K. Nayar,et al.  FaceTracer: A Search Engine for Large Collections of Images with Faces , 2008, ECCV.

[25]  Chunxiao Liu,et al.  On-the-fly feature importance mining for person re-identification , 2014, Pattern Recognit..

[26]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[28]  Trevor Darrell,et al.  Pose pooling kernels for sub-category recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Shaogang Gong,et al.  Person Re-Identification , 2014 .

[30]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[32]  Ming-Hsuan Yang,et al.  Learning Gender with Support Faces , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[34]  David G. Lowe,et al.  Local Naive Bayes Nearest Neighbor for image classification , 2011, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Xiaoou Tang,et al.  Pedestrian Attribute Recognition At Far Distance , 2014, ACM Multimedia.

[36]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[38]  Tsuhan Chen,et al.  Extracting adaptive contextual cues from unlabeled regions , 2011, 2011 International Conference on Computer Vision.

[39]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[40]  Gaurav Sharma,et al.  Learning discriminative spatial representation for image classification , 2011, BMVC.

[41]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  Lamberto Ballan,et al.  Love Thy Neighbors: Image Annotation by Exploiting Image Metadata , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.