Video genre verification using both acoustic and visual modes

This paper reports on the verification of the video genre: sport, cartoon, news, commercial and music. Results for the two modes, acoustic and visual, and for combined modes show an average equal error rate (ERR) of 16%, 15% and 10%, respectively. These reflect verification accuracy and as such are believed to be the first of their kind; previously published work has focused on closed set identification, assuming the video is known to belong to one of a fixed set. The results also demonstrate the influence of the genre to be classified: the best performance for the visual mode has an EER of 4% (cartoons), and the best performance for the acoustic mode has EER of 0.6% (news). Finally, the combination of the modes presents a more consistent accuracy across the five genre with an EER of 10%.

[1]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  John S. D. Mason,et al.  Classification of video genre using audio , 2001, INTERSPEECH.

[3]  Michael J. Carey,et al.  A speaker verification system using alpha-nets , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Victor Zue,et al.  Automatic transcription of general audio data: preliminary analyses , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[6]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[7]  Ba Tu Truong,et al.  Automatic genre identification for content-based video categorization , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[8]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[9]  Zhu Liu,et al.  Classification TV programs based on audio information using hidden Markov model , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[10]  Wolfgang Effelsberg,et al.  Automatic recognition of film genres , 1995, MULTIMEDIA '95.