Discovery and fusion of salient multimodal features toward news story segmentation

In this paper, we present our new results in news video story segmentation and classification in the context of TRECVID video retrieval benchmarking event 2003. We applied and extended the Maximum Entropy statistical model to effectively fuse diverse features from multiple levels and modalities, including visual, audio, and text. We have included various features such as motion, face, music/speech types, prosody, and high-level text segmentation information. The statistical fusion model is used to automatically discover relevant features contributing to the detection of story boundaries. One novel aspect of our method is the use of a feature wrapper to address different types of features -- asynchronous, discrete, continuous and delta ones. We also developed several novel features related to prosody. Using the large news video set from the TRECVID 2003 benchmark, we demonstrate satisfactory performance (F1 measures up to 0.76 in ABC news and 0.73 in CNN news), present how these multi-level multi-modal features construct the probabilistic framework, and more importantly observe an interesting opportunity for further improvement.

[1]  Qian Huang,et al.  Adaptive anchor detection using online trained audio/visual model , 1999, Electronic Imaging.

[2]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[3]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[4]  Jacqueline Vaissière,et al.  Language-Independent Prosodic Features , 1983 .

[5]  Stanley Boykin,et al.  Machine learning of event segmentation for news on demand , 2000, CACM.

[6]  Tat-Seng Chua,et al.  The Segmentation and Classification of Story Boundaries in News Video , 2002, VDB.

[7]  Shih-Fu Chang,et al.  A statistical framework for fusing mid-level perceptual features in news story segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  Hao Jiang,et al.  Integrating visual, audio and text analysis for news video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[9]  Barry Arons,et al.  Pitch-based emphasis detection for segmenting speech recordings , 1994, ICSLP.

[10]  Tomas E. Ward,et al.  Segmentation and detection at IBM: Hybrid statistical models and two-tiered clustering broadcast new , 2000 .

[11]  Shih-Fu Chang,et al.  A highly efficient system for automatic face region detection in MPEG video , 1997, IEEE Trans. Circuits Syst. Video Technol..

[12]  Chin-Hui Lee,et al.  The segmentation of news video into story units , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[13]  Shih-Fu Chang,et al.  Segmentation, structure detection and summarization of multimedia sequences , 2002 .

[14]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[15]  Salim Roukos,et al.  Story Segmentation and Topic Detection in the Broadcast News Domain , 1999 .

[16]  Edward Y. Chang,et al.  Adaptive Feature-Space Conformal Transformation for Imbalanced-Data Learning , 2003, ICML.