Human action recognition using weighted pooling

Pooling strategies, such as max pooling and sum pooling, have been widely used to obtain the global representations for action videos. However, these pooling strategies have several disadvantages. First, they are easily affected by unwanted background local features, the absence of discriminative local features and the times of actions periodically performed by actors. Second, most pooling strategies only use local features to build the global representation that captures little mid-level features for action representation. In this study, the authors propose a novel weighted pooling strategy based on actionlets representation for action recognition. The actionlets are defined as the movements of large bodies such as legs, arms and head, which capture rich mid-level features for action representation. Besides, the authors’ method also incorporates the distribution information of actionlets into pooling procedure. Specifically, a pooling weight, which determines the importance of actionlet on the final video representation, is assigned to each actionlet. To learn the weight, they propose a novel discriminative learning algorithm to capture the discriminative information for pooling operation. They evaluate their weighted pooling on three datasets: KTH actions dataset, UCF sports dataset and Youtube actions dataset. Experimental results show the effectiveness of the proposed method.

[1]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[2]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[3]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[5]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Yun Fu,et al.  Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition , 2010, ACCV.

[8]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Thomas B. Moeslund,et al.  A selective spatio-temporal interest point detector for human action recognition in complex scenes , 2011, 2011 International Conference on Computer Vision.

[10]  Ioannis Patras,et al.  Learning codebook weights for action detection , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[11]  Shaogang Gong,et al.  Discriminative Topics Modelling for Action Feature Selection and Recognition , 2010, BMVC.

[12]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.