Joint action recognition and summarization by sub-modular inference

Action recognition and video summarization are two important multimedia tasks that are useful for applications such as video indexing and retrieval, video surveillance, human-computer interaction and home intelligence. While many approaches exist in the literature for these two tasks, to date they have always been addressed separately. Instead, in this paper we move from the assumption that these two tasks should be tackled as a joint objective: on the one hand, action recognition can drive the selection of meaningful and informative summaries; on the other, recognizing actions from a summary rather than the entire video can in principle reduce noise and prove more accurate. To this aim, we propose a novel approach for joint action recognition-summarization based on the performing latent structural SVM framework, together with an efficient algorithm for inferring the action and the summary based on the property of sub-modularity. Experimental results on a challenging benchmark, MSR Dai-lyActivity3D, show that the approach is capable of achieving remarkable action recognition accuracy while providing appealing video summaries.

[1]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Meinard Müller,et al.  Motion templates for automatic classification and retrieval of motion capture data , 2006, SCA '06.

[3]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[4]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[6]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[7]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[8]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Sebastian Nowozin,et al.  Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[10]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[12]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[13]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[15]  Thorsten Joachims,et al.  Large-Margin Learning of Submodular Summarization Models , 2012, EACL.

[16]  Hong-Yuan Mark Liao,et al.  Human action recognition using associated depth and skeleton information , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Regunathan Radhakrishnan,et al.  A Unified Framework for Video Summarization, Browsing & Retrieval: with Applications to Consumer and Surveillance Video , 2005 .

[18]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[19]  Yan Liu,et al.  Unsupervised summarization of rushes videos , 2010, ACM Multimedia.

[20]  Regunathan Radhakrishnan,et al.  A Unified Framework for Video Summarization, Browsing, and Retrieval , 2006 .

[21]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[26]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..