Still Image Action Recognition by Predicting Spatial-Temporal Pixel Evolution

Both spatial and temporal patterns provide crucial information for recognizing human actions. However, lack of temporal information in still images is a major obstacle in single-image action recognition. In this paper, (i) We introduce a novel image representation domain, Ranked Saliency Map and Predicted Optical Flow or Rank_SM-POF for short. This domain captures both actor appearance and the future movement patterns of the actor. This is accomplished through capturing the temporal ordering of each pixel by training a linear ranking machine on the predicted tensor of spatial-temporal representation of images. (ii) We employ a transfer learning approach to propose a new spatial-temporal Convolutional Neural Network, named STCNN for the task of single image action classification, by fine-tuning a CNN which is pre-trained specifically for appearance based classification. (iii) Finally, extensive experiments on five benchmarks clearly demonstrate that appearance and motion are complementary sources of information and using both leads to significant performance improvement in single image action recognition, hence outperforming state-of-the-art methods.

[1]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[2]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[3]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[4]  Martial Hebert,et al.  Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[7]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[9]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Stan Sclaroff,et al.  Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web , 2015, Pattern Recognit..

[11]  Michael Felsberg,et al.  Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[15]  Huimin Ma,et al.  Single Image Action Recognition Using Semantic Body Part Actions , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Cordelia Schmid,et al.  Expanded Parts Model for Semantic Description of Humans in Still Images , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[18]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[20]  Michael Felsberg,et al.  Semantic Pyramids for Gender and Action Recognition , 2014, IEEE Transactions on Image Processing.

[21]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[22]  Huimin Ma,et al.  Generalized symmetric pair model for action classification in still images , 2017, Pattern Recognit..

[23]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Fahad Shahbaz Khan,et al.  Recognizing Actions Through Action-Specific Person Detection , 2015, IEEE Transactions on Image Processing.

[27]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Jianfei Cai,et al.  Action Recognition in Still Images With Minimum Annotation Efforts , 2016, IEEE Transactions on Image Processing.

[30]  Huimin Ma,et al.  Semantic parts based top-down pyramid for action recognition , 2016, Pattern Recognit. Lett..

[31]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[33]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[34]  Yong Xu,et al.  Image-based action recognition using hint-enhanced deep neural networks , 2017, Neurocomputing.

[35]  Minh Hoai,et al.  Regularized Max Pooling for Image Categorization , 2014, BMVC.

[36]  Dahua Lin,et al.  Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[38]  Liang Lin,et al.  An expressive deep model for human action parsing from a single image , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[39]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[40]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[41]  Xue Li,et al.  Action recognition in still images using a combination of human pose and context information , 2012, 2012 19th IEEE International Conference on Image Processing.

[42]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[43]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[45]  Jitendra Malik,et al.  R-CNNs for Pose Estimation and Action Detection , 2014, ArXiv.