Spatio-Temporal Analysis for Human Action Detection and Recognition in Uncontrolled Environments

Understanding semantic meaning of human actions captured in unconstrained environments has broad applications in fields ranging from patient monitoring, human-computer interaction, to surveillance systems. However, while great progresses have been achieved on automatic human action detection and recognition in videos that are captured in controlled/constrained environments, most existing approaches perform unsatisfactorily on videos with uncontrolled/unconstrained conditions e.g., significant camera motion, background clutter, scaling, and light conditions. To address this issue, the authors propose a robust human action detection and recognition framework that works effectively on videos taken in controlled or uncontrolled environments. Specifically, the authors integrate the optical flow field and Harris3D corner detector to generate a new spatial-temporal information representation for each video sequence, from which the general Gaussian mixture model GMM is learned. All the mean vectors of the Gaussian components in the generated GMM model are concatenated to create the GMM supervector for video action recognition. They build a boosting classifier based on a set of sparse representation classifiers and hamming distance classifiers to improve the accuracy of action recognition. The experimental results on two broadly used public data sets, KTH and UCF YouTube Action, show that the proposed framework outperforms the other state-of-the-art approaches on both action detection and recognition.

[1]  Shu-Ching Chen,et al.  Moving Object Detection under Object Occlusion Situations in Video Sequences , 2011, 2011 IEEE International Symposium on Multimedia.

[2]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[3]  Mei-Ling Shyu,et al.  Spatial-temporal motion information integration for action detection and recognition in non-static background , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[6]  I. Jolliffe Principal Component Analysis , 2002 .

[7]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[8]  Wen Gao,et al.  Semantic object segmentation by a spatio-temporal MRF model , 2004, ICPR 2004.

[9]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[10]  Vassilios Morellas,et al.  Action recognition using global spatio-temporal features derived from sparse representations , 2014, Comput. Vis. Image Underst..

[11]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[12]  Mei-Ling Shyu,et al.  Utilizing Context Information to Enhance Content-Based Image Classification , 2011, Int. J. Multim. Data Eng. Manag..

[13]  Somayeh Danafar,et al.  Action Recognition for Surveillance Applications Using Optic Flow and SVM , 2007, ACCV.

[14]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[15]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[16]  Z. Zivkovic Improved adaptive Gaussian mixture model for background subtraction , 2004, ICPR 2004.

[17]  Jun-Wei Hsieh,et al.  Modeling and recognizing action contexts in persons using sparse representation , 2015, J. Vis. Commun. Image Represent..

[18]  Mei-Ling Shyu,et al.  Effective Moving Object Detection and Retrieval via Integrating Spatial-Temporal Multimedia Information , 2012, 2012 IEEE International Symposium on Multimedia.

[19]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jun-Wei Hsieh,et al.  Sparse representation for recognizing object-to-object actions under occlusions , 2013, ICIMCS '13.

[22]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[23]  Mubarak Shah,et al.  Visual attention detection in video sequences using spatiotemporal cues , 2006, MM '06.

[24]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[25]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[26]  Koichi Shinoda,et al.  A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems , 2011, ACM Multimedia.

[27]  Rita Cucchiara,et al.  Detecting Moving Objects, Ghosts, and Shadows in Video Streams , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[29]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Dorin Comaniciu,et al.  Kernel-Based Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[33]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[34]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[35]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[38]  Mei-Ling Shyu,et al.  Semantic Motion Concept Retrieval in Non-Static Background Utilizing Spatial-Temporal Visual Information , 2013, Int. J. Semantic Comput..