Action detection using multiple spatial-temporal interest point features

This paper considers the problem of detecting actions from cluttered videos. Compared with the classical action recognition problem, this paper aims to estimate not only the scene category of a given video sequence, but also the spatial-temporal locations of the action instances. In recent years, many feature extraction schemes have been designed to describe various aspects of actions. However, due to the difficulty of action detection, e.g., the cluttered background and potential occlusions, a single type of features cannot solve the action detection problems perfectly in cluttered videos. In this paper, we attack the detection problem by combining multiple Spatial-Temporal Interest Point (STIP) features, which detect salient patches in the video domain, and describe these patches by feature of local regions. The difficulty of combining multiple STIP features for action detection is two folds: First, the number of salient patches detected by different STIP methods varies across different salient patches. How to combine such features is not considered by existing fusion methods [13] [5]. Second, the detection in the videos should be efficient, which excludes many slow machine learning algorithms. To handle these two difficulties, we propose a new approach which combines Gaussian Mixture Model with Branch-and-Bound search to efficiently locate the action of interest. We build a new challenging dataset for our action detection task, and our algorithm obtains impressive results. On classical KTH dataset, our method outperforms the state-of-the-art methods.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[3]  Hans Jørgen Andersen,et al.  British Machine Vision Conference 2006 , 2006 .

[4]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[5]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[11]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, ICCV.

[13]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  David Elliott,et al.  In the Wild , 2010 .

[15]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Jiebo Luo,et al.  Heterogeneous feature machines for visual recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[18]  Zicheng Liu,et al.  Hierarchical Filtered Motion for Action Recognition in Crowded Videos , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[19]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Ming Yang,et al.  Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor , 2009, ACM Multimedia.

[22]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Christoph H. Lampert,et al.  Learning to Localize Objects with Structured Output Regression , 2008, ECCV.

[24]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  Benjamin Z. Yao,et al.  Learning deformable action templates from cluttered videos , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  James M. Rehg,et al.  Quasi-periodic event analysis for social game retrieval , 2009, 2009 IEEE 12th International Conference on Computer Vision.