Learning a discriminative mid-level feature for action recognition

In this paper, we address the problem of recognizing human actions from videos. Most of the existing approaches employ low-level features (e.g., local features and global features) to represent an action video. However, algorithms based on low-level features are not robust to complex environments such as cluttered background, camera movement and illumination change. Therefore, we propose a novel random forest learning framework to construct a discriminative and informative mid-level feature from low-level features of densely sampled 3D cuboids. Each cuboid is classified by the corresponding random forests with a novel fusion scheme, and the cuboid’s posterior probabilities of all categories are normalized to generate a histogram. After that, we obtain our mid-level feature by concatenating histograms of all the cuboids. Since a single low-level feature is not enough to capture the variations of human actions, multiple complementary low-level features (i.e., optical flow and histogram of gradient 3D features) are employed to describe 3D cuboids. Moreover, temporal context between local cuboids is exploited as another type of low-level feature. The above three low-level features (i.e., optical flow, histogram of gradient 3D features and temporal context) are effectively fused in the proposed learning framework. Finally, the mid-level feature is employed by a random forest classifier for robust action recognition. Experiments on the Weizmann, UCF sports, Ballet, and multi-view IXMAS datasets demonstrate that out mid-level feature learned from multiple low-level features can achieve a superior performance over state-of-the-art methods.

[1]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  Leo Breiman,et al.  Randomizing Outputs to Increase Prediction Accuracy , 2000, Machine Learning.

[3]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[4]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[6]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Vincent Lepetit,et al.  Keypoint recognition using randomized trees , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Zhiwu Lu,et al.  Spectral learning of latent semantics for action recognition , 2011, 2011 International Conference on Computer Vision.

[10]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Patrick Pérez,et al.  Cross-View Action Recognition from Temporal Self-similarities , 2008, ECCV.

[14]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[15]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[19]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[22]  Xiaoqin Zhang,et al.  Adaptive learning codebook for action recognition , 2011, Pattern Recognit. Lett..

[23]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[24]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[25]  Pascal Fua,et al.  Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Wei Liang,et al.  Discriminative human action recognition in the learned hierarchical manifold space , 2010, Image Vis. Comput..

[28]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[29]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[30]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[31]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Wei Liang,et al.  Incremental discriminant-analysis of canonical correlations for action recognition , 2010, Pattern Recognit..

[33]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[34]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.