Combining appearance and motion for human action classification in videos

An important cue to high level scene understanding is to analyze the objects in the scene and their behavior and interactions. In this paper, we study the problem of classification of activities in videos, as this is an integral component of any scene understanding system, and present a novel approach for recognizing human action categories in videos by combining information from appearance and motion of human body parts. Our approach is based on tracking human body parts by using mixture particle filters and then clustering the particles using local non - parametric clustering, hence associating a local set of particles to each cluster mode. The trajectory of these cluster modes provides the “motion” information and the “appearance” information is provided by the statistical information about the relative motion of these local set of particles over a number of frames. Later we use a “Bag of Words” model to build one histogram per video sequence from the set of these robust appearance and motion descriptors. These histograms provide us characteristic information which helps us to discriminate among various human actions which ultimately helps us in better understanding of the complete scene. We tested our approach on the standard KTH and Weizmann human action datasets and the results were comparable to the state of the art methods. Additionally our approach is able to distinguish between activities that involve the motion of complete body from those in which only certain body parts move. In other words, our method discriminates well between activities with “global body motion” like running, jogging etc. and “local motion” like waving, boxing etc.

[1]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  James J. Little,et al.  A Boosted Particle Filter: Multitarget Detection and Tracking , 2004, ECCV.

[3]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[4]  Mubarak Shah,et al.  View-invariance in action recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[5]  Cordelia Schmid,et al.  Evaluation of Interest Point Detectors , 2000, International Journal of Computer Vision.

[6]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[7]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Ze-Nian Li,et al.  Successive Convex Matching for Action Detection , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[10]  Stefan Carlsson,et al.  Recognizing and Tracking Human Action , 2002, ECCV.

[11]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[12]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[13]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Patrick Pérez,et al.  Color-Based Probabilistic Tracking , 2002, ECCV.

[15]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[18]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[19]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[21]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[22]  Ankur Agarwal,et al.  Learning to track 3D human motion from silhouettes , 2004, ICML.

[23]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[26]  Patrick Pérez,et al.  Maintaining multimodality through mixture tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[27]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.