Transductive Inference with Hierarchical Clustering for Video Annotation

In this paper, we present a novel framework for video semantic detection based on transductive inference and hierarchical clustering, which directly focuses on predicting the available samples in a current unlabeled pool, instead of trying to build a classifier workable for any unavailable data. In this framework, a number of hierarchical clustering results are constructed from the entire video dataset containing both labeled and unlabeled examples. We aim to make the clusters as pure as possible, i.e., samples in a same cluster mostly have a same label. To further purify these hierarchical clustering results, an EM based cluster-tuning algorithm is iteratively employed. Based on these clustering results, several hypotheses are generated by probability voting among labeled samples in the obtained clusters. From these hypotheses, one of them is chosen according to the Vapnik combined bound, and it is then applied to predict the labels of unlabeled samples. This selected transductive hypothesis, which is only interested in predicting the available unlabeled samples in test set rather than producing a general classifier like inductive inference learning, exploits the structure and distribution of the unlabeled pool to achieve a minimal test error bound. Thus it can have better generalization ability for video annotation both theoretically and experimentally. This is also shown by our experiment results.