Optimizing multi-graph learning: towards a unified video annotation scheme

Learning based semantic video annotation is a promising approach for enabling content-based video search. However, severe difficulties, such as insufficiency of training data and curse of dimensionality, are frequently encountered. This paper proposes a novel unified scheme, Optimized Multi-Graph-based Semi-Supervised Learning (OMG-SSL), to simultaneously attack these difficulties. Instead of only using a single graph, OMG-SSL integrates multiple graphs into a regularization and optimization framework to sufficiently explore their complementary nature. We then show that various crucial factors in video annotation, including multiple modalities, multiple distance metrics, and temporal consistency, in fact all correspond to different correlations among samples, and hence they can be represented by different graphs. Therefore, OMG-SSL is able to simultaneously deal with these factors within a unified framework. Experiments on the TRECVID benchmark demonstrate the effectiveness of our proposed approach.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  Meng Wang,et al.  Multi-Graph Semi-Supervised Learning for Video Semantic Feature Extraction , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[3]  Jun Yang,et al.  Exploring temporal consistency for video analysis and retrieval , 2006, MIR '06.

[4]  Rong Yan,et al.  Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Nicu Sebe,et al.  Toward Robust Distance Metric Analysis for Similarity Estimation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Sanjeev Khudanpur,et al.  Hidden Markov models for automatic annotation and content-based retrieval of images and video , 2005, SIGIR '05.

[7]  Stefan M. Rüger,et al.  Mining multimedia salient concepts for incremental information extraction , 2005, SIGIR '05.

[8]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[9]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[10]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[11]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[12]  John R. Smith,et al.  User-trainable video annotation using multimodal cues , 2003, SIGIR '03.

[13]  Jingrui He,et al.  Manifold-ranking based image retrieval , 2004, MULTIMEDIA '04.

[14]  John R. Kender,et al.  Video News Shot Labeling Refinement via Shot Rhythm Models , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[15]  Alexander G. Hauptmann Lessons for the Future from a Decade of Informedia Video Analysis Research , 2005, CIVR.

[16]  Meng Wang,et al.  Manifold-ranking based video concept detection on large database and feature pool , 2006, MM '06.

[17]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[18]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[19]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[20]  Wei-Ying Ma,et al.  Graph based multi-modality learning , 2005, ACM Multimedia.

[21]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[22]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[23]  Rong Yan,et al.  The combination limit in multimedia retrieval , 2003, MULTIMEDIA '03.

[24]  Meng Wang,et al.  Semi-automatic video annotation based on active learning with multiple complementary predictors , 2005, MIR '05.

[25]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[26]  Xiangming Mu,et al.  Content-based video retrieval: does video's semantic visual feature matter? , 2006, SIGIR.

[27]  Maria-Florina Balcan,et al.  Person Identification in Webcam Images: An Application of Semi-Supervised Learning , 2005 .

[28]  Meng Wang,et al.  Automatic video annotation by semi-supervised learning with kernel density estimation , 2006, MM '06.

[29]  Nicu Sebe,et al.  Toward Improved Ranking Metrics , 2000, IEEE Trans. Pattern Anal. Mach. Intell..