COLUMBIA-IBM NEWS VIDEO STORY SEGMENTATION IN TRECVID 2004

In this technical report, we give an overview of our technical developments in the story segmentation task in TRECVID 2004. Among them, we propose an information-theoretic framework, visual cue cluster construction (VC), to automatically discover adequate mid-level features. The problem is posed as mutual information maximization, through which optimal cue clusters are discovered to preserve the highest information about the semantic labels. We extend the Information Bottleneck framework to high-dimensional continuous features and further propose a projection method to map each video into probabilistic memberships over all the cue clusters. The biggest advantage of the proposed approach is to remove the dependence on the manual process in choosing the mid-level features and the huge labor cost involved in annotating the training corpus for training the detector of each mid-level feature. When tested in TRECVID 2004 news video story segmentation, the proposed approach achieves promising performance gain over representations derived from conventional clustering techniques and even the mid-level features selected manually; meanwhile, it achieved one of the top performances, F1=0.65, close to the highest performance, F1=0.69, by other groups. We also experiment with other promising visual features and continue investigating effective prosody features. The introduction of post-processing also provides practical improvements. Furthermore, the fusion from other modalities, such as speech prosody features and ASR-based segmentation scores are significant and have been confirmed again in this experiment.

[1]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[2]  Pinar Duygulu Sahin,et al.  What's News, What's Not? Associating News Videos with Words , 2004, CIVR.

[3]  Keiichiro Hoashi,et al.  Shot Boundary Determination on MPEC Compressed Domain and Story Segmentation Experiments for TRECVID 2003 , 2003, TRECVID.

[4]  Shih-Fu Chang,et al.  Visual Cue Cluster Construction via Information Bottleneck Principle and Kernel Density Estimation , 2005, CIVR.

[5]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[6]  Alireza Khotanzad,et al.  Invariant Image Recognition by Zernike Moments , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[8]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[10]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[11]  B. S. Manjunath,et al.  Texture features and learning similarity , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[13]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[14]  Jacqueline Vaissière,et al.  Language-Independent Prosodic Features , 1983 .

[15]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[16]  Shih-Fu Chang,et al.  Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[17]  Shih-Fu Chang,et al.  A statistical framework for fusing mid-level perceptual features in news story segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Shih-Fu Chang,et al.  Segmentation, structure detection and summarization of multimedia sequences , 2002 .

[19]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[20]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[23]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[24]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[25]  Qi Tian,et al.  A Two-Level Multi-Modal Approach for Story Segmentation of Large News Video Corpus , 2003, TRECVID.

[26]  Barry Arons,et al.  Pitch-based emphasis detection for segmenting speech recordings , 1994, ICSLP.

[27]  Tomas E. Ward,et al.  Segmentation and detection at IBM: Hybrid statistical models and two-tiered clustering broadcast new , 2000 .