Unsupervised Learning of Activities in Video Using Scene Context

Unsupervised learning of semantic activities from video collected over time is an important problem for visual surveillance and video scene understanding. Our goal is to cluster tracks into semantically interpretable activity models that are independent of scene locations; most previous work in video scene understanding is focused on learning location-specific normalcy models. Location-independent models can be used to detect instances of the same activity anywhere in the scene, or even across multiple scenes. Our insight for this unsupervised activity learning problem is to incorporate scene context to characterize the behavior of every track. By scene context, we mean local scene structures, such as building entrances, parking spots and roads, that moving objects frequently interact with. Each track is attributed with large number of potentially useful features that capture the relationships and interactions with a set of existing scene context elements. Once feature vectors are obtained, tracks are grouped in this feature space using state-of-the-art clustering techniques, without considering scene location. Experiments are conducted on webcam video of a complex scene, with many interacting objects and very noisy tracks resulting from low frame rates and poor image quality. Our results demonstrate that location-independent and semantically interpretable groupings can be successfully obtained using unsupervised clustering methods, and that the models are superior to standard location-dependent clustering.

[1]  W. Eric L. Grimson,et al.  Trajectory analysis and semantic region modeling using a nonparametric Bayesian model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  A.G.A. Perera,et al.  Learning Motion Patterns in Surveillance Video using HMM Clustering , 2008, 2008 IEEE Workshop on Motion and video Computing.

[3]  A. G. Amitha Perera,et al.  Multi-Object Tracking Through Simultaneous Long Occlusions and Split-Merge Conditions , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[5]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[6]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Mubarak Shah,et al.  Multi feature path modeling for video surveillance , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  Anthony Hoogs,et al.  Functional scene element recognition for video scene analysis , 2009, 2009 Workshop on Motion and Video Computing (WMVC).

[9]  Mubarak Shah,et al.  Video Scene Understanding Using Multi-scale Analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  W. Eric L. Grimson,et al.  Learning Semantic Scene Models by Trajectory Analysis , 2006, ECCV.