A generic framework for event detection in various video domains

Event detection is essential for the extensively studied video analysis and understanding area. Although various approaches have been proposed for event detection, there is a lack of a generic event detection framework that can be applied to various video domains (e.g. sports, news, movies, surveillance). In this paper, we present a generic event detection approach based on semi-supervised learning and Internet vision. Concretely, a Graph-based Semi-Supervised Multiple Instance Learning (GSSMIL) algorithm is proposed to jointly explore small-scale expert labeled videos and large-scale unlabeled videos to train the event models to detect video event boundaries. The expert labeled videos are obtained from the analysis and alignment of well-structured video related text (e.g. movie scripts, web-casting text, close caption). The unlabeled data are obtained by querying related events from the video search engine (e.g. YouTube) in order to give more distributive information for event modeling. A critical issue of GSSMIL in constructing a graph is the weight assignment, where the weight of an edge specifies the similarity between two data points. To tackle this problem, we propose a novel Multiple Instance Learning Induced Similarity (MILIS) measure by learning instance sensitive classifiers. We perform the thorough experiments in three popular video domains: movies, sports and news. The results compared with the state-of-the-arts are promising and demonstrate our proposed approach is performance-effective.

[1]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[2]  Chng Eng Siong,et al.  Automatic generation of personalized music sports video , 2005, MULTIMEDIA '05.

[3]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[4]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[5]  Noboru Babaguchi,et al.  Sports event detection using temporal patterns mining and web-casting text , 2008, AREA '08.

[6]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[7]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[8]  Robert D. Nowak,et al.  Unlabeled data: Now it helps, now it doesn't , 2008, NIPS.

[9]  Shih-Fu Chang,et al.  Event detection in baseball video using superimposed caption recognition , 2002, MULTIMEDIA '02.

[10]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[11]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[13]  Mirella Lapata,et al.  Proceedings of ACL-08: HLT , 2008 .

[14]  Noboru Babaguchi,et al.  Personalized abstraction of broadcasted American football video by highlight selection , 2004, IEEE Transactions on Multimedia.

[15]  Tianzhu Zhang,et al.  Learning semantic scene models by object classification and trajectory clustering , 2009, CVPR.

[16]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[17]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[18]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  A. Murat Tekalp,et al.  Automatic Soccer Video Analysis and Summarization , 2003, IS&T/SPIE Electronic Imaging.

[20]  Hyung-Myung Kim,et al.  Summarization of news video and its description for content‐based access , 2003, Int. J. Imaging Syst. Technol..

[21]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[22]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[24]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Changsheng Xu,et al.  Live sports event detection based on broadcast video and web-casting text , 2006, MM '06.

[26]  Thomas S. Huang,et al.  Image processing , 1971 .

[27]  Hong Cheng,et al.  Sparsity induced similarity measure for label propagation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Changshui Zhang,et al.  Instance-level Semisupervised Multiple Instance Learning , 2008, AAAI.

[29]  Deb Roy,et al.  Grounded Language Modeling for Automatic Speech Recognition of Sports Video , 2008, ACL.

[30]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[31]  Masashi Sugiyama,et al.  Robust Label Propagation on Multiple Networks , 2009, IEEE Transactions on Neural Networks.

[32]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[33]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[34]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[35]  Chih-Wei Huang Automatic Closed Caption Alignment Based on Speech Recognition Transcripts , 2003 .

[36]  Mubarak Shah,et al.  Learning object motion patterns for anomaly detection and improved object detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.