Probability-based method for boosting human action recognition using scene context

In this study, the authors investigate the possibility of boosting action recognition performance by exploiting the associated scene context. Towards this end, the authors model a scene as a mid-level `middle layer' in order to bridge action descriptors and action categories. This is achieved via a scene topic model, in which hybrid visual descriptors, including spatial-temporal action features and scene descriptors, are first extracted from a video sequence. Then, the authors learn a joint probability distribution between scene and action using a naive Bayes nearest neighbour algorithm, which is adopted to jointly infer the action categories online by combining off-the-shelf action recognition algorithms. The authors demonstrate the advantages of their approach by comparing it with state-of-the-art approaches using several action recognition benchmarks.

[1]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Klamer Schutte,et al.  Instantaneous threat detection based on a semantic representation of activities, zones and trajectories , 2014, Signal, Image and Video Processing.

[3]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[4]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[5]  Zhenguo Li,et al.  Modeling Scene and Object Contexts for Human Action Retrieval With Few Examples , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  William J. Christmas,et al.  Improving human activity detection by combining multi-dimensional motion descriptors with boosting , 2006, ICPR.

[7]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[8]  Wei Wang,et al.  The Proceedings of the Second International Conference on Communications, Signal Processing, and Systems , 2014, ICC 2014.

[9]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[10]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Shuicheng Yan,et al.  STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Shaozi Li,et al.  Selecting Effective and Discriminative Spatio-Temporal Interest Points for Recognizing Human Action , 2013, IEICE Trans. Inf. Syst..

[13]  Jun-Wei Hsieh,et al.  Video-Based Human Movement Analysis and Its Application to Surveillance Systems , 2008, IEEE Transactions on Multimedia.

[14]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[16]  Mubarak Shah,et al.  Learning semantic features for action recognition via diffusion maps , 2012, Comput. Vis. Image Underst..

[17]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[18]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[20]  Shaozi Li,et al.  Adaptive photograph retrieval method , 2012, Multimedia Tools and Applications.

[21]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[22]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Ivan Laptev,et al.  Local Descriptors for Spatio-temporal Recognition , 2004, SCVMA.

[24]  Dong Xu,et al.  Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Wen Qu,et al.  Action-Scene Model for Recognizing Human Actions from Background in Realistic Videos , 2014, WAIM.

[27]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[28]  Won Jong Jeon,et al.  Spatio-temporal pyramid matching for sports videos , 2008, MIR '08.

[29]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[32]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.