Learning to Identify TV News Monologues by Style and Context

We focus on the problem of learning semantics from multimedia data associated with broadcast video documents. In this paper we propose to learn semantic concepts from multimodal sources based on style and context detectors, in combination with statistical classier ensembles. As a case study we present our method for detecting the concept of news subject monologues. This approach had the best average precision performance amongst 26 submissions in the 2003 video track of the Text Retrieval Conference benchmark. Experiments were conducted with respect to individual detector contribution, ensemble size, and ranking mechanism. It was found that the combination of detectors is decisive for the nal result, although some detectors might appear useless in isolation. Moreover, by using a probabilistic ranking, in combination with a large classier ensemble, results can be improved even further.

[1]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[2]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[3]  Marcel Worring,et al.  Interactive Adaptive Movie Annotation , 2003, IEEE Multim..

[4]  Thomas S. Huang,et al.  Factor graph framework for semantic video indexing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[5]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Tobun Dorbin Ng,et al.  Informedia at TRECVID 2003 : Analyzing and Searching Broadcast News Video , 2003, TRECVID.

[8]  Svetha Venkatesh,et al.  Toward automatic extraction of expressive elements from motion pictures: tempo , 2002, IEEE Trans. Multim..

[9]  Michael G. Christel,et al.  Enhanced access to digital video through visually rich interfaces , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[10]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[11]  Harriet J. Nock,et al.  Audio-visual synchrony for detection of monologues in video archives , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Haim H. Permuter,et al.  IBM Research TREC 2002 Video Retrieval System , 2002, TREC.

[13]  John Zimmerman,et al.  A probabilistic layered framework for integrating multimedia content and context information , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  David Bordwell,et al.  Film Art: An Introduction , 1979 .

[15]  Takeo Kanade,et al.  Object Detection Using the Statistics of Parts , 2004, International Journal of Computer Vision.

[16]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[17]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[19]  Yihong Gong,et al.  Lessons Learned from Building a Terabyte Digital Video Library , 1999, Computer.

[20]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Georges Quénot,et al.  CLIPS at TREC 11: Experiments in Video Retrieval , 2002, TREC.

[22]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23]  Roberto Brunelli,et al.  Person identification using multiple cues , 1995, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Marcel Worring,et al.  Multimedia event-based video indexing using time intervals , 2005, IEEE Transactions on Multimedia.