Interactive Video Annotation by Multi-Concept Multi-Modality Active Learning

Active learning has been demonstrated to be an effective approach to reducing human labeling effort in multimedia annotation tasks. However, most of the existing active learning methods for video annotation are studied in a relatively simple context where concepts are sequentially annotated with fixed effort and only a single modality is applied. However, we usually have to deal with multiple modalities, and sequentially annotating concepts without preference cannot suitably assign annotation effort. To address these two issues, in this paper we propose a multi-concept multi-modality active learning method for video annotation in which multiple concepts and multiple modalities can be simultaneously taken into consideration. In each round of active learning, this method selects the concept that is expected to get the highest performance gain and a batch of suitable samples to be annotated for this concept. Then, a graph-based semi-supervised learning is conducted on each modality for the selected concept. The proposed method is able to sufficiently explore the human effort by considering both the learnabilities of different concepts and the potentials of different modalities. Experimental results on TRECVID 2005 benchmark have demonstrated its effectiveness and efficiency.