Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation

News video story segmentation is a critical task for automatic video indexing and summarization. Our prior work has demonstrated promising performance by using a generative model, called maximum entropy (ME), which models the posterior probability given the multi-modal perceptual features near the candidate points. In this paper, we investigate alternative statistical approaches based on discriminative models, i.e. support vector machine (SVM), and ensemble learning, i.e. boosting. In addition, we develop a novel approach, called BoostME, which uses the ME classifiers and the associated confidence scores in each boosting iteration. We evaluated these different methods using the TRECVID 2003 broadcast news data set. We found that SVM-based and ME-based techniques both outperformed the pure boosting techniques, with the SVM-based solutions achieving even slightly higher accuracy. Moreover, we summarize extensive analysis results of error sources over distinctive news story types to identify future research opportunities

[1]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[2]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Paul A. Viola,et al.  Boosting Image Retrieval , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[6]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.