Multi-modality web video categorization

This paper reports a first comprehensive study and large-scale test on web video (so-called user generated video or micro video) categorization. Observing that web videos are characterized by a much higher diversity of quality, subject, style, and genres compared with traditional video programs, we focus on studying the effectiveness of different modalities in dealing with such high variation. Specifically, we propose two novel modalities including a semantic modality and a surrounding text modality, as effective complements to most commonly used low-level features. The semantic modality includes three feature representations, i.e., concept histogram, visual word vector model and visual word Latent Semantic Analysis (LSA), while text modality includes the titles, descriptions and tags of web videos. We conduct a set of comprehensive experiments for evaluating the effectiveness of the proposed feature representations over different classifiers such as Support Vector Machine (SVM), Gaussian Mixture Model (GMM) and Manifold Ranking (MR). Our experiments on a large-scale dataset with 11k web videos (nearly 450 hours in total) demonstrate that (1) the proposed multimodal feature representation is effective for web video categorization, and (2) SVM outperforms GMM and MR on nearly all the modalities.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[4]  Rainer Lienhart,et al.  Abstracting home video automatically , 1999, MULTIMEDIA '99.

[5]  Ba Tu Truong,et al.  Automatic genre identification for content-based video categorization , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[6]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[7]  Fabrice Souvannavong,et al.  Latent semantic analysis for an effective region-based video shot retrieval system , 2004, MIR '04.

[8]  Yaser Sheikh,et al.  On the use of computable features for film classification , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Kazuaki Kishida Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments , 2005 .

[10]  Vasile Palade,et al.  Multi-Classifier Systems: Review and a roadmap for developers , 2006, Int. J. Hybrid Intell. Syst..

[11]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[12]  Meng Wang,et al.  Microsoft Research Asia TRECVID 2006 High-Level Feature Extraction and Rushes Exploitation , 2006, TRECVID.

[13]  Lie Lu,et al.  Automatic mood detection and tracking of music audio signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[15]  Tao Mei,et al.  Automatic Video Genre Categorization using Hierarchical SVM , 2006, 2006 International Conference on Image Processing.

[16]  Rong Yan,et al.  Multi-Lingual Broadcast News Retrieval , 2006, TRECVID.

[17]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.