Machine-learning based classification of speech and music

The need to classify audio into categories such as speech or music is an important aspect of many multimedia document retrieval systems. In this paper, we investigate audio features that have not been previously used in music-speech classification, such as the mean and variance of the discrete wavelet transform, the variance of Mel-frequency cepstral coefficients, the root mean square of a lowpass signal, and the difference of the maximum and minimum zero-crossings. We, then, employ fuzzy C-means clustering to the problem of selecting a viable set of features that enables better classification accuracy. Three different classification frameworks have been studied:Multi-Layer Perceptron (MLP) Neural Networks, radial basis functions (RBF) Neural Networks, and Hidden Markov Model (HMM), and results of each framework have been reported and compared. Our extensive experimentation have identified a subset of features that contributes most to accurate classification, and have shown that MLP networks are the most suitable classification framework for the problem at hand.

[1]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Jean-Pierre Martens,et al.  A comparison of human and automatic musical genre classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[4]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[5]  Mark B. Sandler,et al.  Classification of audio signals using statistical features on time and wavelet transform domains , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[7]  George Tzanetakis,et al.  A framework for audio analysis based on classification and temporal segmentation , 1999, Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium.

[8]  Muhammad Kashif Saeed Khan Automatic classification of speech and music in digitized audio , 2005 .

[9]  Friedrich Jondral,et al.  Classification of transient time-varying signals using DFT and wavelet packet based methods , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Harry Wechsler,et al.  Detection of human speech in structured noise , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Liang Gu,et al.  Robust singing detection in speech/music discriminator design , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Guy de Collongue,et al.  Speech/Music/Silence and Gender Detection Algorithm , 2001 .

[13]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[14]  Michael J. Carey,et al.  Feature fusion for music detection , 1999, EUROSPEECH.

[15]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[17]  Richard J. Mammone,et al.  Artificial neural networks for speech and vision , 1994 .

[18]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[19]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[20]  Lie Lu,et al.  A robust audio classification and segmentation method , 2001, MULTIMEDIA '01.

[21]  George Tzanetakis,et al.  Automatic Musical Genre Classification of Audio Signals , 2001, ISMIR.

[22]  Stefan Karnebäck Discrimination between speech and music based on a low frequency modulation feature , 2001, INTERSPEECH.

[23]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[24]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  David G. Stork,et al.  Pattern Classification , 1973 .

[26]  Paul M. Baggenstoss,et al.  Speech music discrimination using class-specific features , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[27]  Liming Chen,et al.  Robust speech music discrimination using spectrum's first order statistics and neural networks , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[28]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[29]  Julien Pinquier,et al.  Speech and music classification in audio documents , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[31]  Julien Pinquier,et al.  A fusion study in speech / music classification , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[32]  Alessandra Flammini,et al.  Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach , 2002, EURASIP J. Adv. Signal Process..

[33]  H. Wechsler,et al.  RBF models for detection of human speech in structured noise , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[34]  E.M. Saad,et al.  A multifeature speech/music discrimination system , 2002, Proceedings of the Nineteenth National Radio Science Conference.

[35]  Julien Pinquier,et al.  Robust speech / music classification in audio documents , 2002, INTERSPEECH.

[36]  Mohan S. Kankanhalli,et al.  Harmonicity and dynamics-based features for audio , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Wen Gao,et al.  A fast and robust speech/music discrimination approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[38]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[39]  Mohan S. Kankanhalli,et al.  Applying neural network on the content-based audio classification , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[40]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).