Scale coding bag of deep features for human attribute and action recognition

Most approaches to human attribute and action recognition in still images are based on image representation in which multi-scale local features are pooled across scale into a single, scale-invariant encoding. Both in bag-of-words and the recently popular representations based on convolutional neural networks, local features are computed at multiple scales. However, these multi-scale convolutional features are pooled into a single scale-invariant representation. We argue that entirely scale-invariant image representations are sub-optimal and investigate approaches to scale coding within a bag of deep features framework. Our approach encodes multi-scale information explicitly during the image encoding stage. We propose two strategies to encode multi-scale information explicitly in the final image representation. We validate our two scale coding techniques on five datasets: Willow, PASCAL VOC 2010, PASCAL VOC 2012, Stanford-40 and Human Attributes (HAT-27). On all datasets, the proposed scale coding approaches outperform both the scale-invariant method and the standard deep features of the same network. Further, combining our scale coding approaches with standard deep features leads to consistent improvement over the state of the art.

[1]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Luis Herranz,et al.  Scene Recognition with CNNs: Objects, Scales and Dataset Bias , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[5]  Xiu-Shen Wei,et al.  Deep Spatial Pyramid Ensemble for Cultural Event Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[6]  Cordelia Schmid,et al.  Improving Bag-of-Features for Large Scale Image Search , 2010, International Journal of Computer Vision.

[7]  Michael Felsberg,et al.  Scale Coding Bag-of-Words for Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[8]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[9]  Subhransu Maji,et al.  Deep filter banks for texture recognition and segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael Felsberg,et al.  Semantic Pyramids for Gender and Action Recognition , 2014, IEEE Transactions on Image Processing.

[11]  Gaurav Sharma,et al.  Learning discriminative spatial representation for image classification , 2011, BMVC.

[12]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[13]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[14]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[16]  F. Xavier Roca,et al.  On Importance of Interactions and Context in Human Action Recognition , 2011, IbPRIA.

[17]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Fahad Shahbaz Khan,et al.  Modulating Shape Features by Color Attention for Object Recognition , 2012, International Journal of Computer Vision.

[20]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Xiaoyuan Zhu,et al.  Robust Action Recognition Using Multi-Scale Spatial-Temporal Concatenations of Local Features as Natural Action Structures , 2012, PloS one.

[24]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[28]  Anton van den Hengel,et al.  The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[30]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[31]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[32]  Cees Snoek,et al.  No spare parts: Sharing part detectors for image categorization , 2015, Comput. Vis. Image Underst..

[33]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[34]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Ekta Vats,et al.  Fuzzy human motion analysis: A review , 2014, Pattern Recognit..

[37]  Jitendra Malik,et al.  R-CNNs for Pose Estimation and Action Detection , 2014, ArXiv.

[38]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[39]  Fahad Shahbaz Khan,et al.  Recognizing Actions Through Action-Specific Person Detection , 2015, IEEE Transactions on Image Processing.

[40]  Jorma Laaksonen,et al.  Convolutional Network Features for Scene Recognition , 2014, ACM Multimedia.

[41]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[42]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Liang Lin,et al.  An expressive deep model for human action parsing from a single image , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[44]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[45]  Cordelia Schmid,et al.  Learning Color Names for Real-World Applications , 2009, IEEE Transactions on Image Processing.

[46]  Cordelia Schmid,et al.  Expanded Parts Model for Semantic Description of Humans in Still Images , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[48]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[49]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Petros Maragos,et al.  Pattern Spectrum and Multiscale Shape Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Fahad Shahbaz Khan,et al.  The Impact of Color on Bag-of-Words Based Object Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[53]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[54]  Andrew Zisserman,et al.  Action Recognition From Weak Alignment of Body Parts , 2014, BMVC.

[55]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[56]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[57]  Robert Bergevin,et al.  Semantic human activity recognition: A literature review , 2015, Pattern Recognit..

[58]  Songfan Yang,et al.  Multi-scale Recognition with DAG-CNNs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  J. Koenderink The structure of images , 2004, Biological Cybernetics.

[61]  Ronan Sicre,et al.  Discriminative part model for visual recognition , 2015, Comput. Vis. Image Underst..

[62]  Andrew P. Witkin,et al.  Scale-space filtering: A new approach to multi-scale description , 1984, ICASSP.

[63]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Louis Chevallier,et al.  SPLeaP: Soft Pooling of Learned Parts for Image Classification , 2016, ECCV.

[65]  Song-Chun Zhu,et al.  Human Attribute Recognition by Rich Appearance Dictionary , 2013, 2013 IEEE International Conference on Computer Vision.

[66]  David A. Clausi,et al.  Multiple scale-specific representations for improved human action recognition , 2013, Pattern Recognit. Lett..

[67]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[68]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[69]  Minh Hoai,et al.  Regularized Max Pooling for Image Categorization , 2014, BMVC.

[70]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[71]  Michael Felsberg,et al.  Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.