Detecting Categories in News Video Using Acoustic, Speech, and Image Features

This work describes systems for detecting semantic categories present in news video. The multimedia data was processed in three ways: the audio signal was converted to a sequence of acoustic features, automatic speech recognition provided a word-level transcription, and image features were computed for selected frames of the video signal. Primary acoustic, speech, and vision systems were trained to discriminate instances of the categories. Higher-level systems exploited correlations among the categories, incorporated sequential context, and combined the joint evidence from the three information sources. We present experimental results from the TREC video retrieval evaluation.

[1]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[2]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[3]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[4]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[5]  Christian Petersohn Fraunhofer HHI at TRECVID 2004: Shot Boundary Detection System , 2004, TRECVID.

[6]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Sanjeev Khudanpur,et al.  TRECVID 2005 Experiment at Johns Hopkins University: Using Hidden Markov Models for Video Retrieval , 2005, TRECVID.

[8]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[9]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Wen Wang,et al.  Investigation on Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[11]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.