Video Annotation Through Search and Graph Reinforcement Mining

Unlimited vocabulary annotation of multimedia documents remains elusive despite progress solving the problem in the case of a small, fixed lexicon. Taking advantage of the repetitive nature of modern information and online media databases with independent annotation instances, we present an approach to automatically annotate multimedia documents that uses mining techniques to discover new annotations from similar documents and to filter existing incorrect annotations. The annotation set is not limited to words that have training data or for which models have been created. It is limited only by the words in the collective annotation vocabulary of all the database documents. A graph reinforcement method driven by a particular modality (e.g., visual) is used to determine the contribution of a similar document to the annotation target. The graph supplies possible annotations of a different modality (e.g., text) that can be mined for annotations of the target. Experiments are performed using videos crawled from YouTube. A customized precision-recall metric shows that the annotations obtained using the proposed method are superior to those originally existing for the document. These extended, filtered tags are also superior to a state-of-the-art semi-supervised technique for graph reinforcement learning on the initial user-supplied annotations.

[1]  Pablo Rodriguez,et al.  I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system , 2007, IMC '07.

[2]  Meng Wang,et al.  Optimizing multi-graph learning: towards a unified video annotation scheme , 2007, ACM Multimedia.

[3]  Thomas S. Huang,et al.  Automatic Video Annotation by Mining Speech Transcripts , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[4]  John R. Smith,et al.  Semantic representation: search and mining of multimedia content , 2004, KDD '04.

[5]  Wei-Ying Ma,et al.  Graph based multi-modality learning , 2005, ACM Multimedia.

[6]  Gary Geisler,et al.  Tagging video: conventions and strategies of the YouTube community , 2007, JCDL '07.

[7]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[8]  Bernardo A. Huberman,et al.  The Structure of Collaborative Tagging Systems , 2005, ArXiv.

[9]  Tao Mei,et al.  Concurrent Multiple Instance Learning for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Alexander Villacorta,et al.  Spheres of influence , 2007, SIGGRAPH '07.

[11]  Rong Yan,et al.  Model-shared subspace boosting for multi-label classification , 2007, KDD '07.

[12]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[13]  Shumeet Baluja,et al.  Pagerank for product image search , 2008, WWW.

[14]  Meng Wang,et al.  An Automatic Video Semantic Annotation Scheme Based on Combination of Complementary Predictors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Meng Wang,et al.  MSRA-USTC-SJTU at TRECVID 2007: High-Level Feature Extraction and Search , 2007, TRECVID.

[16]  Mor Naaman,et al.  HT06, tagging paper, taxonomy, Flickr, academic article, to read , 2006, HYPERTEXT '06.

[17]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[18]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[19]  Xiaojun Wan,et al.  Graph-Based MultiModality Learning for Topic-Focused Multi-Document Summarization , 2009 .

[20]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[21]  Mor Naaman,et al.  Why do tagging systems work? , 2006, CHI Extended Abstracts.

[22]  Wei-Ying Ma,et al.  AnnoSearch: Image Auto-Annotation by Search , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Andrew Tomkins,et al.  Guest Editors' Introduction: Social Media and Search , 2007, IEEE Internet Computing.

[24]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[25]  Chong-Wah Ngo,et al.  Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts , 2007, ACM Multimedia.

[26]  Wei-Ying Ma,et al.  Bipartite graph reinforcement model for web image annotation , 2007, ACM Multimedia.

[27]  S. Leigh,et al.  Probability and Random Processes for Electrical Engineering , 1989 .

[28]  Rong Yan,et al.  Mining Associated Text and Images with Dual-Wing Harmoniums , 2005, UAI.

[29]  B. S. Manjunath,et al.  Spirittagger: a geo-aware tag suggestion tool mined from flickr , 2008, MIR '08.

[30]  R. Manmatha,et al.  Statistical models for automatic video annotation and retrieval , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  B. S. Manjunath,et al.  Automatic video annotation through search and mining , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[32]  Vincent S. Tseng,et al.  Integrated Mining of Visual Features, Speech Features, and Frequent Patterns for Semantic Video Annotation , 2008, IEEE Transactions on Multimedia.