Graphical models for context-aware analysis of continuous videos

In this paper, we show how graphical models can used for the localization and recognition of activities in continuous videos. The model consists of an action layer and a hidden activity layer. The action layer is modeled as a linear-chain conditional random field (CRF) with the activity labels of action segments as the model variables. Hidden activity variables are then introduced to smooth out the activity labels of action segments and thus generating semantically meaningful activities. With a task-oriented discriminative approach, the learning problem is formulated as a latent Structural Support Vector Machine (SSVM). We show promising results on the UCLA Office Dataset that demonstrate the effectiveness of the proposed framework.

[1]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[3]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[4]  Christian P. Robert,et al.  Machine Learning, a Probabilistic Perspective , 2014 .

[5]  Amit K. Roy-Chowdhury,et al.  Modeling multi-object interactions using "string of feature graphs" , 2013, Comput. Vis. Image Underst..

[6]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Amit K. Roy-Chowdhury,et al.  Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[9]  Amit K. Roy-Chowdhury,et al.  Context-Aware Activity Recognition and Anomaly Detection in Video , 2013, IEEE Journal of Selected Topics in Signal Processing.

[10]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[11]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[12]  Amit K. Roy-Chowdhury,et al.  A “string of feature graphs” model for recognition of complex activities in natural videos , 2011, 2011 International Conference on Computer Vision.