A Speech/Music Discriminator of Radio Recordings Based on Dynamic Programming and Bayesian Networks

This paper presents a multistage system for speech/music discrimination which is based on a three-step procedure. The first step is a computationally efficient scheme consisting of a region growing technique and operates on a 1-D feature sequence, which is extracted from the raw audio stream. This scheme is used as a preprocessing stage and yields segments with high music and speech precision at the expense of leaving certain parts of the audio recording unclassified. The unclassified parts of the audio stream are then fed as input to a more computationally demanding scheme. The latter treats speech/music discrimination of radio recordings as a probabilistic segmentation task, where the solution is obtained by means of dynamic programming. The proposed scheme seeks the sequence of segments and respective class labels (i.e., speech/music) that maximize the product of posterior class probabilities, given the data that form the segments. To this end, a Bayesian Network combiner is embedded as a posterior probability estimator. At a final stage, an algorithm that performs boundary correction is applied to remove possible errors at the boundaries of the segments (speech or music) that have been previously generated. The proposed system has been tested on radio recordings from various sources. The overall system accuracy is approximately 96%. Performance results are also reported on a musical genre basis and a comparison with existing methods is given.

[1]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[2]  Vladimir Pavlovic,et al.  Bayesian networks as ensemble of classifiers , 2002, Object recognition supported by user interaction for service robots.

[3]  Douglas Eck,et al.  Frame-Level Speech/Music Discrimination using AdaBoost , 2005 .

[4]  Sergios Theodoridis,et al.  Pattern Recognition, Third Edition , 2006 .

[5]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Hervé Bourlard,et al.  Speech/music segmentation using entropy and dynamism features in a HMM classification framework , 2003, Speech Commun..

[8]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[9]  David Heckerman,et al.  Causal independence for probability assessment and inference using Bayesian networks , 1996, IEEE Trans. Syst. Man Cybern. Part A.

[10]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[11]  Hynek Hermansky,et al.  Spectral entropy based feature for robust ASR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Gregory H. Wakefield,et al.  Audio thumbnailing of popular music using chroma-based representations , 2005, IEEE Transactions on Multimedia.

[13]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[15]  I. Miller Probability, Random Variables, and Stochastic Processes , 1966 .

[16]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[18]  Pedro J. Moreno,et al.  Using the Fisher kernel method for Web audio classification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).