Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences is very useful for a variety of tasks, such as video surveillance, objectlevel video summarization, video indexing, digital library organization, etc. However, it remains a challenging task for computers to achieve robust action recognition due to cluttered background, camera motion, occlusion, and geometric and photometric variances of objects. For example, in a live video of a skating competition, the skater moves rapidly across the rink, and the camera also moves to follow the skater. With moving camera, non-stationary background, and moving target, few vision algorithms could identify, categorize and localize such motions well. In addition, the challenge is even greater when there are multiple activities in a complex video sequence (Figure1). We present a video demo for our novel unsupervised learning method for human action categories [ 1]. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm learns the probability distributions of the spatial-temporal words and intermediate topics corresponding to human action categories automatically using a probabilistic Latent Semantic Analysis (pLSA) model [ 4]. The learned model is then used for human action categorization and localization in a novel video, by maximizing the posterior of action category (topic) distributions. The contributions of this work are as follows: • Unsupervised learning of actions using ‘video words’ representation . We deploy a pLSA model with ‘bag of video words’ representation for video analysis; • Multiple action localization and categorization. Our approach is not only able to classify different actions, but also to localize different actions simultaneously in a novel and complex video sequence.

[1]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[2]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[3]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[4]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).