Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition

Abstract The detection of the spatial-temporal interest points has a key role in human action recognition algorithms. This research work aims to exploit the existing strength of bag-of-visual features and presents a method for automatic action recognition in realistic and complex scenarios. This paper provides a better feature representation by combining the benefit of both a well-known feature detector and descriptor i.e. the 3D Harris space-time interest point detector and the 3D Scale-Invariant Feature Transform descriptor. Finally, action videos are represented using a histogram of visual features by following the traditional bag-of-visual feature approach. Apart from video representation, a support vector machine (SVM) classifier is used for training and testing. A large number of experiments show the effectiveness of our method on existing benchmark datasets and shows state-of-the-art performance. This article reports 68.1% mean Average Precision (mAP), 94% and 91.8% average accuracy for Hollywood-2, UCF Sports and KTH datasets respectively.

[1]  Luc Van Gool,et al.  Efficient Mining of Frequent and Distinctive Feature Configurations , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[3]  Du-Ming Tsai,et al.  Optical flow-motion history image (OF-MHI) for action recognition , 2015, Signal Image Video Process..

[4]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[5]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[6]  Thomas B. Moeslund,et al.  A selective spatio-temporal interest point detector for human action recognition in complex scenes , 2011, 2011 International Conference on Computer Vision.

[7]  Lin Sun,et al.  DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[9]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[11]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[13]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[14]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[16]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[19]  Haibin Ling,et al.  3D R Transform on Spatio-temporal Interest Points for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[21]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Lihi Zelnik-Manor,et al.  Statistical analysis of dynamic actions , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Yun Fu,et al.  Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition , 2010, ACCV.