Co-training non-robust classifiers for video semantic concept detection

Semantic video characterization by automatic metadata tagging is increasingly popular. While some of these concepts are unimodal manifest in image or audio modalities, a large number of such concepts are multimodal manifest in both the image and the audio modalities. Further while some concepts like outdoors and face occur sufficiently in terms of frequency of occurrence in training sets, a large number are rarer to find thus making them difficult to detect during automatic annotation. Semi-supervised learning algorithms such as co-training may help by incorporating a large amount of unlabeled data, which holds the promise of allowing the redundant information across views to improve the learning performance. Unfortunately, this promise has not been realized in multimedia content analysis partly because the models built using the labeled data alone are not too robust and their noisy classification of the unlabeled data set compounds problems faced by the co-training algorithm. In this paper we analyze whether a judicious application of co-training for automatically labeling some of the unlabeled samples and reinducting them into the training set along with manual quality control can help improve the detection performance. We report our findings in the context of the TRECVID 2003 common annotation corpus.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[3]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[4]  Rong Yan,et al.  Multi-class active learning for video semantic feature extraction , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[5]  Thomas S. Huang,et al.  Factor graph framework for semantic video indexing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[6]  John R. Smith,et al.  Active learning for simultaneous annotation of multiple binary semantic concepts [video content analysis] , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[7]  Tobun Dorbin Ng,et al.  Informedia at TRECVID 2003 : Analyzing and Searching Broadcast News Video , 2003, TRECVID.

[8]  Paul A. Viola,et al.  Unsupervised improvement of visual detectors using cotraining , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[10]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[11]  Milind R. Naphade,et al.  Probabilistic Semantic Video Indexing , 2000, NIPS.

[12]  John R. Smith,et al.  VideoAnnEx: IBM MPEG-7 Annotation Tool for Multimedia Indexing and Concept Learning , 2003 .