Spatio-temporal layout of human actions for improved bag-of-words action detection

We investigate how human action recognition can be improved by considering spatio-temporal layout of actions. From literature, we adopt a pipeline consisting of STIP features, a random forest to quantize the features into histograms, and an SVM classifier. Our goal is to detect 48 human actions, ranging from simple actions such as walk to complex actions such as exchange. Our contribution to improve the performance of this pipeline by exploiting a novel spatio-temporal layout of the 48 actions. Here each STIP feature does not in the video contributes to the histogram bins by a unity value, but rather by a weight given by its spatio-temporal probability. We propose 6 configurations of spatio-temporal layout, where the varied parameters are the coordinate system and the modeling of the action and its context. Our model of layout does not change any other parameter of the pipeline, it requires no re-learning of the random forest, yields a limited increase of the size of its resulting representation by only a factor two, and at a minimal additional computational cost of only a handful of operations per feature. Extensive experiments show that the layout is demonstrated to be distinctive of actions that involve trajectories, (dis)appearance, kinematics, and interactions. The visualization of each action's layout illustrates that our approach is indeed able to model spatio-temporal patterns of each action. Each layout is experimentally shown to be optimal for a specific set of actions. Generally, the context has more effect than the choice of coordinate system. The most impressive improvements are achieved for complex actions involving items. For 43 out of 48 human actions, the performance is better or equal when spatio-temporal layout is included. In addition, we show our method outperforms state-of-the-art for the IXMAS and UT-Interaction datasets.

[1]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[4]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[7]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Henri Bouma,et al.  Automatic human action recognition in a scene from visual inputs , 2012, Defense + Commercial Sensing.

[9]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[10]  Klamer Schutte,et al.  Correlations between 48 human actions improve their detection , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[11]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Klamer Schutte,et al.  Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos , 2013, Machine Vision and Applications.

[13]  Luc Van Gool,et al.  Variations of a Hough-Voting Action Recognition System , 2010, ICPR Contests.

[14]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[20]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[21]  David Elliott,et al.  In the Wild , 2010 .

[22]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.