Semantic Pyramids for Gender and Action Recognition

Person description is a challenging problem in computer vision. We investigated two major aspects of person description: 1) gender and 2) action recognition in still images. Most state-of-the-art approaches for gender and action recognition rely on the description of a single body part, such as face or full-body. However, relying on a single body part is suboptimal due to significant variations in scale, viewpoint, and pose in real-world images. This paper proposes a semantic pyramid approach for pose normalization. Our approach is fully automatic and based on combining information from full-body, upper-body, and face regions for gender and action recognition in still images. The proposed approach does not require any annotations for upper-body and face of a person. Instead, we rely on pretrained state-of-the-art upper-body and face detectors to automatically extract semantic information of a person. Given multiple bounding boxes from each body part detector, we then propose a simple method to select the best candidate bounding box, which is used for feature extraction. Finally, the extracted features from the full-body, upper-body, and face regions are combined into a single representation for classification. To validate the proposed approach for gender recognition, experiments are performed on three large data sets namely: 1) human attribute; 2) head-shoulder; and 3) proxemics. For action recognition, we perform experiments on four data sets most used for benchmarking action recognition in still images: 1) Sports; 2) Willow; 3) PASCAL VOC 2010; and 4) Stanford-40. Our experiments clearly demonstrate that the proposed approach, despite its simplicity, outperforms state-of-the-art methods for gender and action recognition.

[1]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Edwin R. Hancock,et al.  Gender discriminating models from facial surface normals , 2011, Pattern Recognit..

[3]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Andrew Zisserman,et al.  2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images , 2012, International Journal of Computer Vision.

[6]  Muhammad Ghulam,et al.  Gender Recognition Using Nonsubsampled Contourlet Transform and WLD Descriptor , 2013, SCIA.

[7]  Cordelia Schmid,et al.  Learning Color Names for Real-World Applications , 2009, IEEE Transactions on Image Processing.

[8]  Matti Pietikäinen,et al.  IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2009, TPAMI-2008-09-0620 1 WLD: A Robust Local Image Descriptor , 2022 .

[9]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[10]  Bernt Schiele,et al.  New features and insights for pedestrian detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Zhenhua Guo,et al.  Rotation invariant texture classification using LBP variance (LBPV) with global matching , 2010, Pattern Recognit..

[12]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[13]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[17]  Fahad Shahbaz Khan,et al.  Portmanteau Vocabularies for Multi-Cue Image Representation , 2011, NIPS.

[18]  Luc Van Gool,et al.  Real-time facial feature detection using conditional regression forests , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[20]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[21]  Larry S. Davis,et al.  Human detection using partial least squares analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Weishan Dong,et al.  Head-shoulder based gender recognition , 2013, 2013 IEEE International Conference on Image Processing.

[23]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[25]  Michael Felsberg,et al.  Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  Paul C. Miller,et al.  Full body image feature representations for gender profiling , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[28]  Shihong Lao,et al.  Multiple Human Tracking Based on Multi-view Upper-Body Detection and Discriminative Learning , 2010, 2010 20th International Conference on Pattern Recognition.

[29]  F. Xavier Roca,et al.  On Importance of Interactions and Context in Human Action Recognition , 2011, IbPRIA.

[30]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[32]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Fahad Shahbaz Khan,et al.  Color attributes for object detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[35]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[36]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[37]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Kun Duan,et al.  Discovering localized attributes for fine-grained recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Zhenhua Guo,et al.  A Completed Modeling of Local Binary Pattern Operator for Texture Classification , 2010, IEEE Transactions on Image Processing.

[43]  Edwin R. Hancock,et al.  Facial gender classification using shape-from-shading , 2010, Image Vis. Comput..

[44]  Fahad Shahbaz Khan,et al.  Discriminative compact pyramids for object and scene recognition , 2012, Pattern Recognition.

[45]  Roope Raisamo,et al.  Evaluation of Gender Classification Methods with Automatically Detected and Aligned Faces , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[48]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[49]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[50]  Luís A. Alexandre Gender recognition: A multiscale decision fusion approach , 2010, Pattern Recognit. Lett..

[51]  Michael Felsberg,et al.  Evaluating the Impact of Color on Texture Recognition , 2013, CAIP.

[52]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[53]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Michael Felsberg,et al.  Scale Coding Bag-of-Words for Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[55]  Andrew Zisserman,et al.  Hand detection using multiple proposals , 2011, BMVC.

[56]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[58]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[59]  Yi Yang,et al.  Recognizing proxemics in personal photos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Caifeng Shan,et al.  Learning local binary patterns for gender classification on real-world face images , 2012, Pattern Recognit. Lett..