Predicting human activities using spatio-temporal structure of interest points

Early recognition and prediction of human activities are of great importance in video surveillance, e.g., by recognizing a criminal activity at its beginning stage, it is possible to avoid unfortunate outcomes. We address early activity recognition by developing a Spatial-Temporal Implicit Shape Model (STISM), which characterizes the space-time structure of the sparse local features extracted from a video. The early recognition of human activities is accomplished by pattern matching through STISM. To enable efficient and robust matching, we propose a new random forest structure, called multi-class balanced random forest, which makes a good trade-off between the balance of the trees and the discriminative abilities. The prediction is done simultaneously for multiple classes, which saves both the memory and computational cost. The experiments show that our algorithm significantly outperforms the state of the arts for the human activity prediction problem.

[1]  Gang Yu,et al.  Propagative Hough Voting for Human Activity Recognition , 2012, ECCV.

[2]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[3]  Snehasis Mukherjee,et al.  Recognizing interaction between human performers using 'key pose doublet' , 2011, ACM Multimedia.

[4]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[5]  Ming Yang,et al.  Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor , 2009, ACM Multimedia.

[6]  Gang Yu,et al.  Real-time human action search using random forest based hough voting , 2011, ACM Multimedia.

[7]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[8]  Tae-Kyun Kim,et al.  Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests , 2010, BMVC.

[9]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  Junsong Yuan,et al.  Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships , 2010, ECCV Workshops.

[11]  正樹 高橋,et al.  ACM Multimedia 2011レポート , 2012 .

[12]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[13]  Gang Yu,et al.  Fast Action Detection via Discriminative Random Forest Voting and Top-K Subvolume Search , 2011, IEEE Transactions on Multimedia.

[14]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.