Mining Visual Actions from Movies

This paper presents an approach for mining visual actions from real-world videos. Given a large number of movies, we want to automatically extract short video sequences corresponding to visual human actions. Firstly, we retrieve actions by mining verbs extracted from the transcripts aligned with the videos. Not all of these samples visually characterize the action and, therefore, we rank these videos by visual consistency. We investigate two unsupervised outlier detection methods: one-class Support Vector Machine (SVM) and densest component estimation of a similarity graph. Alternatively, we show how to use automatic weak supervision provided by a random background class, either by directly applying a binary SVM, or by using an iterative re-training scheme for Support Vector Regression machines (SVR). Experimental results explore actions in 144 episodes of the TV series ''Buffy the Vampire Slayer'' and show: (a) the applicability of our approach to a large scale set of real-world videos, (b) the importance of visual consistency for ranking videos retrieved from text, (c) the added value of random non-action samples and (d) the ability of our iterative SVR re-training algorithm to handle weak supervision. The quality of the rankings obtained is assessed on manually annotated data for six different action classes.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, ICCV.

[3]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  John Lafferty,et al.  Grammatical Trigrams: A Probabilistic Model of Link Grammar , 1992 .

[5]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[6]  Bernhard Schölkopf,et al.  SV Estimation of a Distribution's Support , 1999, NIPS 1999.

[7]  Pinar Duygulu Sahin,et al.  A Graph Based Approach for Naming Faces in News Photos , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[10]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[11]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[12]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[13]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[15]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[16]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[17]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[20]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[21]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[22]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[25]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[26]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[27]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[28]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[30]  Moses Charikar,et al.  Greedy approximation algorithms for finding dense components in a graph , 2000, APPROX.

[31]  Ivor W. Tsang,et al.  Maximum Margin Clustering Made Practical , 2009, IEEE Trans. Neural Networks.

[32]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).