Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models

We propose a novel unsupervised learning framework to model activities and interactions in crowded and complicated scenes. Hierarchical Bayesian models are used to connect three elements in visual surveillance: low-level visual features, simple "atomic" activities, and interactions. Atomic activities are modeled as distributions over low-level visual features, and multi-agent interactions are modeled as distributions over atomic activities. These models are learnt in an unsupervised way. Given a long video sequence, moving pixels are clustered into different atomic activities and short video clips are clustered into different interactions. In this paper, we propose three hierarchical Bayesian models, Latent Dirichlet Allocation (LDA) mixture model, Hierarchical Dirichlet Process (HDP) mixture model, and Dual Hierarchical Dirichlet Processes (Dual-HDP) model. They advance existing language models, such as LDA [1] and HDP [2]. Our data sets are challenging video sequences from crowded traffic scenes and train station scenes with many kinds of activities co-occurring. Without tracking and human labeling effort, our framework completes many challenging visual surveillance tasks of board interest such as: (1) discovering typical atomic activities and interactions; (2) segmenting long video sequences into different interactions; (3) segmenting motions into different activities; (4) detecting abnormality; and (5) supporting high-level queries on activities and interactions.

[1]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[2]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Yiming Yang,et al.  A Probabilistic Model for Online Document Clustering with Application to Novelty Detection , 2004, NIPS.

[4]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[5]  W. Eric L. Grimson,et al.  Spatial Latent Dirichlet Allocation , 2007, NIPS.

[6]  James W. Davis,et al.  The representation and recognition of human movement using temporal templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[11]  Larry S. Davis,et al.  W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[13]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[15]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[16]  Alan E. Gelfand,et al.  SPATIAL NONPARAMETRIC BAYESIAN MODELS , 2001 .

[17]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Ramakant Nevatia,et al.  Multi-agent event recognition , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[22]  Larry S. Davis,et al.  Representation and Recognition of Events in Surveillance Video Using Petri Nets , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[23]  David C. Hogg,et al.  Learning the Distribution of Object Trajectories for Event Recognition , 1995, BMVC.

[24]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[25]  Anthony G. Cohn,et al.  Constructing qualitative event models automatically from video input , 2000, Image Vis. Comput..

[26]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[27]  Mubarak Shah,et al.  TemporalBoost for event recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[28]  Jianbo Shi,et al.  Detecting unusual activity in video , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[29]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[31]  Antonio Torralba,et al.  Describing Visual Scenes using Transformed Dirichlet Processes , 2005, NIPS.

[32]  Matthew Brand,et al.  Discovery and Segmentation of Activities in Video , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[34]  Aaron F. Bobick,et al.  A Framework for Recognizing Multi-Agent Action from Visual Evidence , 1999, AAAI/IAAI.

[35]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[38]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[39]  W. Eric L. Grimson,et al.  Learning Semantic Scene Models by Trajectory Analysis , 2006, ECCV.

[40]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[42]  Tieniu Tan,et al.  Trajectory Series Analysis based Event Rule Induction for Visual Surveillance , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[44]  Svetha Venkatesh,et al.  AdaBoost.MRF: Boosted Markov Random Forests and Application to Multilevel Activity Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[45]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[46]  Shaogang Gong,et al.  Video behaviour profiling and abnormality detection without manual labelling , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[47]  Shaogang Gong,et al.  Beyond Tracking: Modelling Activity and Understanding Behaviour , 2006, International Journal of Computer Vision.