To construct optimal training set for video annotation

This paper exploits the criteria to optimize the training set construction for video annotation. Most existing learning-based semantic annotation approaches require a large training set to achieve good generalization capacity, in which a considerable amount of labor-intensively manual labeling is desirable. However, it is observed that the generalization capacity of a classifier highly depends on the geometrical distribution rather than the size of the training data. We argue that a training set which includes most temporal and spatial distribution of the whole data will achieve a satisfying performance even in the case of limited size of training set. In order to capture the geometrical distribution characteristics of a given video collection, we propose the following four metrics for constructing an optimal training set, including Salience Time Dispersiveness Spatial Dispersiveness and Diversity. Moreover, based on these metrics, we propose a set of optimization rules to capture the most distribution information of the whole data for a training set with a given size. Experimental results demonstrate that these rules are effective for training set construction for video annotation, and significantly outperform random training set selection as well.

[1]  Edward Y. Chang,et al.  Multimodal concept-dependent active learning for image retrieval , 2004, MULTIMEDIA '04.

[2]  Meng Wang,et al.  Semi-automatic video annotation based on active learning with multiple complementary predictors , 2005, MIR '05.

[3]  Li-Rong Dai,et al.  Video Annotation by Active Learning and Cluster Tuning , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[4]  Rong Yan,et al.  Automatically labeling video data using multi-class active learning , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Rong Yan,et al.  Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Jianping Fan,et al.  Semantic video classification by integrating flexible mixture model with adaptive EM algorithm , 2003, MIR '03.

[8]  Shih-Fu Chang,et al.  Structure analysis of sports video using domain models , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[9]  Meng Wang,et al.  Enhanced Semi-Supervised Learning for Automatic Video Annotation , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[10]  Vladimir Vapnik,et al.  Three remarks on the support vector method of function estimation , 1999 .

[11]  G. Baudat,et al.  Feature vector selection and projection using kernels , 2003, Neurocomputing.

[12]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.