Hierarchical Spatial Sum–Product Networks for Action Recognition in Still Images

Recognizing actions from still images has been popularly studied recently. In this paper, we model an action class as a flexible number of spatial configurations of body parts by proposing a new spatial sum–product network (SPN). First, we discover a set of parts in image collections via unsupervised learning. Then, our new spatial SPN is applied to model the spatial relationship and also the high-order correlations of parts. To learn robust networks, we further develop a hierarchical spatial SPN method, which models pairwise spatial relationship between parts inside subimages and models the correlation of subimages via extra layers of SPN. Our method is shown to be effective on two benchmark data sets.

[1]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[2]  Lin Sun,et al.  DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Gang Wang,et al.  Scene Parsing With Integration of Parametric and Non-Parametric Models , 2016, IEEE Transactions on Image Processing.

[4]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[5]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[6]  Charless C. Fowlkes,et al.  Discriminative models for static human-object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[7]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Gang Wang,et al.  Real-time part-based visual tracking via adaptive correlation filters , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Pedro M. Domingos,et al.  Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[12]  Shuicheng Yan,et al.  STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Iqbal Gondal,et al.  On Temporal Order Invariance for View-Invariant Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Pietro Perona,et al.  Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition , 2007, International Journal of Computer Vision.

[16]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Gang Wang,et al.  Multi-modal feature fusion for action recognition in RGB-D sequences , 2014, 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP).

[19]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[22]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[23]  Adnan Darwiche,et al.  A differential approach to inference in Bayesian networks , 2000, JACM.

[24]  Gabriel Thomas,et al.  Human Activity Recognition using Binary Motion Image and Deep Learning , 2015, Procedia Computer Science.

[25]  Xiaogang Wang,et al.  A Deep Sum-Product Architecture for Robust Facial Attributes Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Gang Wang,et al.  Learning Contextual Dependence With Convolutional Hierarchical Recurrent Neural Networks , 2015, IEEE Transactions on Image Processing.

[28]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[32]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[33]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[34]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[35]  Daniel Lowd,et al.  Learning Sum-Product Networks with Direct and Indirect Variable Interactions , 2014, ICML.

[36]  Chunheng Wang,et al.  Cross-View Action Recognition Using Contextual Maximum Margin Clustering , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[37]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[38]  Dong Xu,et al.  Action Recognition Using Multilevel Features and Latent Structural SVM , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[40]  Pedro M. Domingos,et al.  Discriminative Learning of Sum-Product Networks , 2012, NIPS.

[41]  Tian-Tsong Ng,et al.  Multimodal Multipart Learning for Action Recognition in Depth Videos , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Gang Wang,et al.  Learning Contextual Dependencies with Convolutional Hierarchical Recurrent Neural Networks , 2015, ArXiv.

[43]  Gang Wang,et al.  Video Tracking Using Learned Hierarchical Features , 2015, IEEE Transactions on Image Processing.

[44]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Alessia Saggese,et al.  Exploiting the deep learning paradigm for recognizing human actions , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[46]  Nazli Ikizler-Cinbis,et al.  Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47]  Jenq-Neng Hwang,et al.  A Review on Video-Based Human Activity Recognition , 2013, Comput..

[48]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.