A Generic Framework for Video Annotation via Semi-Supervised Learning

Learning-based video annotation is essential for video analysis and understanding, and many various approaches have been proposed to avoid the intensive labor costs of purely manual annotation. However, there lacks a generic framework due to several difficulties, such as dependence of domain knowledge, insufficiency of training data, no precise localization and in efficacy for large-scale video dataset. In this paper, we propose a novel approach based on semi-supervised learning by means of information from the Internet for interesting event annotation in videos. Concretely, a Fast Graph-based Semi-Supervised Multiple Instance Learning (FGSSMIL) algorithm, which aims to simultaneously tackle these difficulties in a generic framework for various video domains (e.g., sports, news, and movies), is proposed to jointly explore small-scale expert labeled videos and large-scale unlabeled videos to train the models. The expert labeled videos are obtained from the analysis and alignment of well-structured video related text (e.g., movie scripts, web-casting text, close caption). The unlabeled data are obtained by querying related events from the video search engine (e.g., YouTube, Google) in order to give more distributive information for event modeling. Two critical issues of FGSSMIL are: (1) how to calculate the weight assignment for a graph construction, where the weight of an edge specifies the similarity between two data points. To tackle this problem, we propose a novel Multiple Instance Learning Induced Similarity (MILIS) measure by learning instance sensitive classifiers; (2) how to solve the algorithm efficiently for large-scale dataset through an optimization approach. To address this issue, Concave-Convex Procedure (CCCP) and nonnegative multiplicative updating rule are adopted. We perform the extensive experiments in three popular video domains: movies, sports, and news. The results compared with the state-of-the-arts are promising and demonstrate the effectiveness and efficiency of our proposed approach.

[1]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[2]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[3]  Masashi Sugiyama,et al.  Robust Label Propagation on Multiple Networks , 2009, IEEE Transactions on Neural Networks.

[4]  Tianzhu Zhang,et al.  Learning semantic scene models by object classification and trajectory clustering , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Deb Roy,et al.  Grounded Language Modeling for Automatic Speech Recognition of Sports Video , 2008, ACL.

[6]  Changsheng Xu,et al.  A generic framework for event detection in various video domains , 2010, ACM Multimedia.

[7]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[8]  Rong Yan,et al.  Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Ramesh C. Jain,et al.  Image annotation by kNN-sparse graph-based label propagation over noisily tagged web images , 2011, TIST.

[10]  Hyung-Myung Kim,et al.  Summarization of news video and its description for content‐based access , 2003, Int. J. Imaging Syst. Technol..

[11]  Mubarak Shah,et al.  Learning object motion patterns for anomaly detection and improved object detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[13]  Changshui Zhang,et al.  Instance-level Semisupervised Multiple Instance Learning , 2008, AAAI.

[14]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[15]  Hilary Buxton,et al.  Learning and understanding dynamic scene activity: a review , 2003, Image Vis. Comput..

[16]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[17]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[18]  Hong Cheng,et al.  Sparsity induced similarity measure for label propagation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Chng Eng Siong,et al.  Automatic generation of personalized music sports video , 2005, MULTIMEDIA '05.

[20]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[21]  Robert D. Nowak,et al.  Unlabeled data: Now it helps, now it doesn't , 2008, NIPS.

[22]  A. Murat Tekalp,et al.  Automatic soccer video analysis and summarization , 2003, IEEE Trans. Image Process..

[23]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[24]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Chih-Wei Huang Automatic Closed Caption Alignment Based on Speech Recognition Transcripts , 2003 .

[27]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[28]  Noboru Babaguchi,et al.  Personalized abstraction of broadcasted American football video by highlight selection , 2004, IEEE Transactions on Multimedia.

[29]  Shih-Fu Chang,et al.  Event detection in baseball video using superimposed caption recognition , 2002, MULTIMEDIA '02.

[30]  Meng Wang,et al.  Correlative Linear Neighborhood Propagation for Video Annotation , 2009, IEEE Trans. Syst. Man Cybern. Part B.

[31]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[32]  Xian-Sheng Hua,et al.  Video Annotation Based on Kernel Linear Neighborhood Propagation , 2008, IEEE Transactions on Multimedia.

[33]  Noboru Babaguchi,et al.  Sports event detection using temporal patterns mining and web-casting text , 2008, AREA '08.

[34]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[36]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.

[37]  Tieniu Tan,et al.  A survey on visual surveillance of object motion and behaviors , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[38]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  Changsheng Xu,et al.  Live sports event detection based on broadcast video and web-casting text , 2006, MM '06.