A distribution based video representation for human action recognition

Most current research on human action recognition in videos uses the bag-of-words (BoW) representations based on vector quantization on local spatial temporal features, due to the simplicity and good performance of such representations. In contrast to the BoW schemes, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local spatial temporal (ST) features of a video into a distribution estimated by a generative probabilistic model such as the Gaussian Mixture Model. Furthermore, this probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable as input to most discriminative classifiers, such as the nearest neighbor schemes and the kernel methods. The experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Hayit Greenspan,et al.  A Continuous Probabilistic Framework for Image Matching , 2001, Comput. Vis. Image Underst..

[3]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[4]  Mubarak Shah,et al.  Recognizing human actions , 2005, VSSN@MM.

[5]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[6]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Zhenmin Tang,et al.  Speaker identification using multi-step clustering algorithm with transformation-based GMM , 2007, Automatic Control and Computer Sciences.

[10]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[13]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[15]  Liang Zhao,et al.  Stereo- and neural network-based pedestrian detection , 1999, Proceedings 199 IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems (Cat. No.99TH8383).

[16]  Jiebo Luo,et al.  Action recognition in unconstrained amateur videos , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Regunathan Radhakrishnan,et al.  Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures , 2004, ICME.

[19]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[21]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[22]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[23]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[24]  Alberto Del Bimbo,et al.  Effective Codebooks for human action categorization , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.