Spatio Temporal Feature Evaluation for Action Recognition

Spatio-Temporal interest points are the most popular feature representation in the field of action recognition. A variety of methods have been proposed to detect and describe local patches in video with several techniques reporting state of the art performance for action recognition. However, the reported results are obtained under different experimental settings with different datasets, making it difficult to compare the various approaches. As a result of this, we seek to comprehensively evaluate state of the art spatio- temporal features under a common evaluation framework with popular benchmark datasets (KTH, Weizmann) and more challenging datasets such as Hollywood2. The purpose of this work is to provide guidance for researchers, when selecting features for different applications with different environmental conditions. In this work we evaluate four popular descriptors (HOG, HOF, HOG/HOF, HOG3D) using a popular bag of visual features representation, and Support Vector Machines (SVM)for classification. Moreover, we provide an in-depth analysis of local feature descriptors and optimize the codebook sizes for different datasets with different descriptors. In this paper, we demonstrate that motion based features offer better performance than those that rely solely on spatial information, while features that combine both types of data are more consistent across a variety of conditions, but typically require a larger codebook for optimal performance.

[1]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[2]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[3]  Matthew A. Brown,et al.  Automatic Panoramic Image Stitching using Invariant Features , 2007, International Journal of Computer Vision.

[4]  Nicu Sebe,et al.  Systematic Evaluation of Spatio-Temporal Features on Comparative Video Challenges , 2010, ACCV Workshops.

[5]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[7]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[8]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  I. Patras,et al.  Spatiotemporal salient points for visual recognition of human actions , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[11]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[12]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[13]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[14]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[15]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[16]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[17]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.