Kernel-based Recognition of Human Actions Using Spatiotemporal Salient Points

This paper addresses the problem of human action recognition by introducing a sparse representation of image sequences as a collection of spatiotemporal events that are localized at points that are salient both in space and time. We detect the spatiotemporal salient points by measuring the variations in the information content of pixel neighborhoods not only in space but also in time. We derive a suitable distance measure between the representations, which is based on the Chamfer distance, and we optimize this measure with respect to a number of temporal and scaling parameters. In this way we achieve invariance against scaling, while at the same time, we eliminate the temporal differences between the representations. We use Relevance Vector Machines (RVM) in order to address the classification problem. We propose new kernels for use by the RVM, which are specifically tailored to the proposed spatiotemporal salient point representation. The basis of these kernels is the optimized Chamfer distance of the previous step. We present results on real image sequences from a small database depicting people performing 19 aerobic exercises.

[1]  M. Brady,et al.  Scale Saliency: a novel approach to salient feature and scale selection , 2003 .

[2]  Nicu Sebe,et al.  Comparing salient point detectors , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[3]  Cordelia Schmid,et al.  Local Grayvalue Invariants for Image Retrieval , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Kosuke Sato,et al.  Real-time gesture recognition by learning and selective control of visual interest points , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Linda G. Shapiro,et al.  Computer and Robot Vision (Volume II) , 2002 .

[6]  Cordelia Schmid,et al.  Evaluation of Interest Point Detectors , 2000, International Journal of Computer Vision.

[7]  Lars Bretzner,et al.  Hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[8]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Max A. Viergever,et al.  Higher order differential structure of images , 1993, Image Vis. Comput..

[10]  Maja Pantic,et al.  Motion history for facial action detection in video , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[11]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[12]  Maja Pantic,et al.  Spatiotemporal saliency for human action recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[13]  Tony Lindeberg,et al.  Feature Detection with Automatic Scale Selection , 1998, International Journal of Computer Vision.

[14]  Gunilla Borgefors,et al.  Hierarchical Chamfer Matching: A Parametric Edge Matching Algorithm , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Jonathon S. Hare,et al.  Salient Regions for Query by Image Content , 2004, CIVR.

[16]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Timor Kadir,et al.  Scale Saliency and Scene Description , 2002 .

[18]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[19]  J. Koenderink,et al.  Representation of local geometry in the visual system , 1987, Biological Cybernetics.

[20]  Max A. Viergever,et al.  Higher Order Differential Structure of Images , 1993, IPMI.