Speech and music classification in audio documents

To index efficiently the soundtrack of multimedia documents, it is necessary to extract elementary and homogeneous acoustic segments. In this paper, we explore such a prior partitioning which consists in detect the two basic components, which are speech and music components. The originality of this work is that music and speech are not considered as two classes and two classification systems are independently defined, a speech/non-speech one and a music/non-music one. This approach permits to better characterize and discriminate each component: in particular, two different feature spaces are necessary as two pairs of Gaussian mixture models. More, the acoustic signal is divided into four types of segments: speech, music, speech-music and other. The experiments are performed on the soundtracks of audio video documents (films, TV sport broadcasts). The performance proves the interest of this approach, so called the Differentiated Modeling Approach.

[1]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Martin Franz,et al.  Topic styles in IR and TDT: effect on system behavior , 2001, INTERSPEECH.

[3]  Jérôme Farinas,et al.  Comparison of two phonetic approaches to language identification , 1999, EUROSPEECH.

[4]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[5]  Jean-Luc Gauvain,et al.  Audio Partitioning and Transcription for Broadcast Data Indexation , 2001, Multimedia Tools and Applications.

[6]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Frédéric Bimbot,et al.  Experiments on speech tracking in audio documents using Gaussian mixture modeling , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Xavier Rodet,et al.  Automatic Characterisation of Musical Signals: Feature Extraction and Temporal Segmentation , 1999 .