Learning automatic concept detectors from online video

Concept detection is targeted at automatically labeling video content with semantic concepts appearing in it, like objects, locations, or activities. While concept detectors have become key components in many research prototypes for content-based video retrieval, their practical use is limited by the need for large-scale annotated training sets. To overcome this problem, we propose to train concept detectors on material downloaded from web-based video sharing portals like YouTube, such that training is based on tags given by users during upload, no manual annotation is required, and concept detection can scale up to thousands of concepts. On the downside, web video as training material is a complex domain, and the tags associated with it are weak and unreliable. Consequently, performance loss is to be expected when replacing high-quality state-of-the-art training sets with web video content. This paper presents a concept detection prototype named TubeTagger that utilizes YouTube content for an autonomous training. In quantitative experiments, we compare the performance when training on web video and on standard datasets from the literature. It is demonstrated that concept detection in web video is feasible, and that - when testing on YouTube videos - the YouTube-based detector outperforms the ones trained on standard training sets. By applying the YouTube-based prototype to datasets from the literature, we further demonstrate that: (1) If training annotations on the target domain are available, the resulting detectors significantly outperform the YouTube-based tagger. (2) If no annotations are available, the YouTube-based detector achieves comparable performance to the ones trained on standard datasets (moderate relative performance losses of 11.4% is measured) while offering the advantage of a fully automatic, scalable learning. (3) By enriching conventional training sets with online video material, performance improvements of 11.7% can be achieved when generalizing to domains unseen in training.

[1]  Marcel Worring,et al.  VideOlympics: Real-Time Evaluation of Multimedia Retrieval Systems , 2008, IEEE MultiMedia.

[2]  H. Ney,et al.  Local Features for Image Classification , .

[3]  Hermann Ney,et al.  Bag-of-visual-words models for adult image classification and filtering , 2008, 2008 19th International Conference on Pattern Recognition.

[4]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[5]  Rong Yan,et al.  How many high-level concepts will fill the semantic gap in news video retrieval? , 2007, CIVR '07.

[6]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[7]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[8]  Adrian Ulges,et al.  Keyframe Extraction for Video Tagging & Summarization , 2008, Informatiktage.

[9]  Paul Over,et al.  TREC video retrieval evaluation: a case study and status report , 2004 .

[10]  Bo Zhang,et al.  Probabilistic model supported rank aggregation for the semantic concept detection in video , 2007, CIVR '07.

[11]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[12]  Wei-Hao Lin,et al.  News video classification using SVM-based multimodal classifiers and combination strategies , 2002, MULTIMEDIA '02.

[13]  Winston H. Hsu,et al.  Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors , 2006 .

[14]  Shih-Fu Chang,et al.  Columbia University TRECVID 2007 High-Level Feature Extraction , 2007, TRECVID.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[17]  Grant Schindler,et al.  Internet video category recognition , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[19]  Adrian Ulges,et al.  Content-based Video Tagging for Online Video Portals ∗ , 2007 .

[20]  Pietro Perona,et al.  A walk through the web’s video clips , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[22]  Marcel Worring,et al.  Learning tag relevance by neighbor voting for social image retrieval , 2008, MIR '08.

[23]  Paul Over,et al.  TRECVID: evaluating the effectiveness of information retrieval tasks on digital video , 2004, MULTIMEDIA '04.

[24]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[25]  Dong Wang,et al.  Video diver: generic video indexing with diverse features , 2007, MIR '07.

[26]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[27]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[28]  Juan Carlos Pérez-Cortes,et al.  Local Representations and a direct Voting Scheme for Face Recognition , 2001, PRIS.

[29]  Jun Yang,et al.  (Un)Reliability of video concept detection , 2008, CIVR '08.

[30]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[31]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[32]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[33]  Adrian Ulges,et al.  Identifying relevant frames in weakly labeled videos for training concept detectors , 2008, CIVR '08.

[34]  Alan F. Smeaton Techniques used and open challenges to the analysis, indexing and retrieval of digital video , 2007, Inf. Syst..

[35]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[36]  Rainer Lienhart,et al.  Reliable Transition Detection in Videos: A Survey and Practitioner's Guide , 2001, Int. J. Image Graph..

[37]  Hideyuki Tamura,et al.  Textural Features Corresponding to Visual Perception , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[38]  Jean-Marc Odobez,et al.  A Thousand Words in a Scene , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Jiebo Luo,et al.  Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[40]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[41]  Franciska de Jong,et al.  Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition , 2007, SAMT.

[42]  Adrian Ulges,et al.  A System That Learns to Tag Videos by Watching Youtube , 2008, ICVS.

[43]  Alan F. Smeaton,et al.  Large Scale Evaluations of Multimedia Information Retrieval: The TRECVid Experience , 2005, CIVR.

[44]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[45]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[46]  Milind R. Naphade,et al.  Classification of video events using 4-dimensional time-compressed motion features , 2007, CIVR '07.

[47]  Shih-Fu Chang,et al.  Context-Based Concept Fusion with Boosted Conditional Random Fields , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[48]  Jeff Z. Pan,et al.  Multimedia annotations on the semantic Web , 2006, IEEE Multimedia.

[49]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .