A statistical framework for fusing mid-level perceptual features in news story segmentation

News story segmentation is essential for video indexing, summarization and intelligence exploitation. In this paper, we present a general statistical framework, called exponential model or maximum entropy model that can systematically select the most significant mid-level features of various types (visual, audio, and semantic) and learn the optimal ways in fusing their combinations in story segmentation. The model utilizes a family of weighted, exponential functions to account for the contributions from different features. The Kullbak-Leibler divergence measure is used in an optimization procedure to iteratively estimate the model parameters, and automatically select the optimal features. The framework is scalable in incorporating new features and adapting to new domains and also discovers how these feature sets contribute to the segmentation work. When tested on foreign news programs, the proposed techniques achieve significant performance improvement over prior work using ad hoc algorithms and slightly better gain over the state of the art using HMM-based models.

[1]  Qian Huang,et al.  Adaptive anchor detection using online trained audio/visual model , 1999, Electronic Imaging.

[2]  Liu Huayong,et al.  The segmentation of news video into story units , 2005 .

[3]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[4]  Lin-Shan Lee,et al.  Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese , 2002, IEEE Trans. Speech Audio Process..

[5]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[6]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[7]  Hao Jiang,et al.  Integrating visual, audio and text analysis for news video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[8]  Shih-Fu Chang,et al.  General and domain-specific techniques for detecting and recognizing superimposed text in video , 2002, Proceedings. International Conference on Image Processing.