A Multimodal Scheme for Program Segmentation and Representation in Broadcast Video Streams

With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots.

[1]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[2]  Jeffrey C. Reynar An Automatic Method of Finding Topic Boundaries , 1994, ACL.

[3]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[4]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.

[5]  Zhu Liu,et al.  Integration of audio and visual information for content-based video segmentation , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[6]  Noboru Babaguchi,et al.  Video clustering using spatio-temporal image with fixed length , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Mubarak Shah,et al.  A general framework for temporal video scene segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Zhu Liu,et al.  Joint scene classification and segmentation based on hidden Markov model , 2005, IEEE Transactions on Multimedia.

[10]  Shih-Fu Chang,et al.  Story boundary detection in large broadcast news video archives: techniques, experience and trends , 2004, MULTIMEDIA '04.

[11]  Alan Hanjalic,et al.  Automatically Segmenting Movies into Logical Story Units , 1999, VISUAL.

[12]  Ling-Yu Duan,et al.  A Semantic Image Category for Structuring TV Broadcast Video Streams , 2006, PCM.

[13]  Changsheng Xu,et al.  Segmentation, categorization, and identification of commercial clips from TV streams using multimodal analysis , 2006, MM '06.

[14]  Judith Masthoff,et al.  Proceedings of the workshop Future TV: Adaptive instruction in your living room , 2002 .

[15]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[16]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[17]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[18]  Shih-Fu Chang,et al.  Semantic video clustering across sources using bipartite spectral clustering , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[19]  Tae-Hee Kim,et al.  Automatic Video Genre Identification Method in MPEG compressed domain , 2002 .

[20]  Tomas E. Ward,et al.  Segmentation and detection at IBM: Hybrid statistical models and two-tiered clustering broadcast new , 2000 .

[21]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[22]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Philip Rennert StreamSage Unsupervised ASR-Based Topic Segmentation , 2003, TRECVID.

[24]  Lie Lu,et al.  A robust audio classification and segmentation method , 2001, MULTIMEDIA '01.

[25]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[26]  Shih-Fu Chang,et al.  News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003 , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[28]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[29]  R.S. Jasinschi,et al.  Automatic TV program genre classification based on audio patterns , 2001, Proceedings 27th EUROMICRO Conference. 2001: A Net Odyssey.

[30]  David Hawking,et al.  Toward better weighting of anchors , 2004, SIGIR '04.

[31]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[32]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Mark Pawlewski,et al.  Video genre classification using dynamics , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  Chong-Wah Ngo,et al.  On clustering and retrieval of video shots through temporal slices analysis , 2002, IEEE Trans. Multim..

[36]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[37]  John Karat,et al.  Personalized Digital Television , 2004, Human-Computer Interaction Series.

[38]  Gang Wei,et al.  TV program classification based on face and text processing , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[39]  Svetha Venkatesh,et al.  Toward automatic extraction of expressive elements from motion pictures: tempo , 2002, IEEE Trans. Multim..

[40]  Ba Tu Truong,et al.  Automatic genre identification for content-based video categorization , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[41]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[42]  Shih-Fu Chang,et al.  Segmentation, structure detection and summarization of multimedia sequences , 2002 .

[43]  Changsheng Xu,et al.  A Mid-Level Scene Change Representation Via Audiovisual Alignment , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.