Person Attribute Recognition with a Jointly-Trained Holistic CNN Model

This paper addresses the problem of human visual attribute recognition, i.e., the prediction of a fixed set of semantic attributes given an image of a person. Previous work often considered the different attributes independently from each other, without taking advantage of possible dependencies between them. In contrast, we propose a method to jointly train a CNN model for all attributes that can take advantage of those dependencies, considering as input only the image without additional external pose, part or context information. We report detailed experiments examining the contribution of individual aspects, which yields beneficial insights for other researchers. Our holistic CNN achieves superior performance on two publicly available attribute datasets improving on methods that additionally rely on pose-alignment or context. To support further evaluations, we present a novel dataset, based on realistic outdoor video sequences, that contains more than 27,000 pedestrians annotated with 10 attributes. Finally, we explore design options to embrace the N/A labels inherently present in this task.

[1]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[4]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[7]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[8]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[9]  Gaurav Sharma,et al.  Learning discriminative spatial representation for image classification , 2011, BMVC.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Song-Chun Zhu,et al.  Human Attribute Recognition by Rich Appearance Dictionary , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[15]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[16]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[18]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[22]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Pietro Perona,et al.  Fine-grained classification of pedestrians in video: Benchmark and state of the art , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).