From Videos to Verbs: Mining Videos for Activities using a Cascade of Dynamical Systems

Clustering video sequences in order to infer and extract activities from a single video stream is an extremely important problem and has significant potential in video indexing, surveillance, activity discovery and event recognition. Clustering a video sequence into activities requires one to simultaneously recognize activity boundaries (activity consistent subsequences) and cluster these activity subsequences. In order to do this, we build a generative model for activities (in video) using a cascade of dynamical systems and show that this model is able to capture and represent a diverse class of activities. We then derive algorithms to learn the model parameters from a video stream and also show how a single video sequence may be clustered into different clusters where each cluster represents an activity. We also propose a novel technique to build affine, view, rate invariance of the activity into the distance metric for clustering. Experiments show that the clusters found by the algorithm correspond to semantically meaningful activities.

[1]  Michael Isard,et al.  Learning and Classification of Complex Dynamics , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Takeo Kanade,et al.  A Multibody Factorization Method for Independently Moving Objects , 1998, International Journal of Computer Vision.

[3]  Stefano Soatto,et al.  Recognition of human gaits , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Vladimir Pavlovic,et al.  Impact of dynamic model learning on classification of human motion , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Steve Mann,et al.  Video orbits of the projective group a simple approach to featureless estimation of parameters , 1997, IEEE Trans. Image Process..

[7]  M. Irani,et al.  Event-Based Video Analysis, , 2001 .

[8]  Yiannis Aloimonos,et al.  Shape and the Stereo Correspondence Problem , 2005, International Journal of Computer Vision.

[9]  B. N. Chatterji,et al.  An FFT-based technique for translation, rotation, and scale-invariant image registration , 1996, IEEE Trans. Image Process..

[10]  Amnon Shashua,et al.  Robust recovery of camera rotation from three frames , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Mubarak Shah,et al.  Motion-based recognition a survey , 1995, Image Vis. Comput..

[13]  B. Moor,et al.  Subspace angles and distances between ARMA models , 2000 .

[14]  H. Akaike A new look at the statistical model identification , 1974 .

[15]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[16]  Bart De Moor,et al.  Subspace algorithms for the stochastic identification problem, , 1993, Autom..

[17]  Regunathan Radhakrishnan,et al.  A Unified Framework for Video Summarization, Browsing, and Retrieval , 2006 .

[18]  Ashok Veeraraghavan,et al.  The Function Space of an Activity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Takeo Kanade,et al.  A multi-body factorization method for motion analysis , 1995, Proceedings of IEEE International Conference on Computer Vision.

[20]  S. Sastry,et al.  An algebraic geometric approach to the identification of a class of linear hybrid systems , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[21]  Dmitry Chetverikov,et al.  A Brief Survey of Dynamic Texture Description and Recognition , 2005, CORES.

[22]  Yaser Sheikh,et al.  Exploring the space of a human action , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Thomas S. Huang,et al.  SOLVING THREE DIMENSIONAL SMALL-ROTATION MOTION EQUATIONS. , 1983, CVPR 1983.

[24]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[25]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[26]  Pietro Perona,et al.  Decomposition of human motion into dynamics-based primitives with application to drawing tasks , 2003, Autom..

[27]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[29]  Jianbo Shi,et al.  Detecting unusual activity in video , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[30]  Robert Pless,et al.  Analysis of Persistent Motion Patterns Using the 3D Structure Tensor , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[31]  Rama Chellappa,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Matching Shape Sequences in Video with Applications in Human Movement Analysis. Ieee Transactions on Pattern Analysis and Machine Intelligence 2 , 2022 .

[32]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.