Online Speech/Music Segmentation Based on the Variance Mean of Filter Bank Energy

This paper presents a novel feature for online speech/music segmentation based on the variance mean of filter bank energy (VMFBE). The idea that encouraged the feature's construction is energy variation in a narrow frequency sub-band. The energy varies more rapidly, and to a greater extent for speech than for music. Therefore, an energy variance in such a sub-band is greater for speech than for music. The radio broadcast database and the BNSI broadcast news database were used for feature discrimination and segmentation ability evaluation. The calculation procedure of the VMFBE feature has 4 out of 6 steps in common with the MFCC feature calculation procedure. Therefore, it is a very convenient speech/music discriminator for use in real-time automatic speech recognition systems based on MFCC features, because valuable processing time can be saved, and computation load is only slightly increased. Analysis of the feature's speech/music discriminative ability shows an average error rate below 10% for radio broadcast material and it outperforms other features used for comparison, by more than 8%. The proposed feature as a stand-alone speech/music discriminator in a segmentation system achieves an overall accuracy of over 94% on radio broadcast material.

[1]  Kristoffer Jensen,et al.  Multiple Scale Music Segmentation Using Rhythm, Timbre, and Harmony , 2007, EURASIP J. Adv. Signal Process..

[2]  Izhak Shafran,et al.  Robust speech detection and segmentation for real-time ASR applications , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Hervé Bourlard,et al.  Speech/music segmentation using entropy and dynamism features in a HMM classification framework , 2003, Speech Commun..

[4]  Stefan Karnebäck Discrimination between speech and music based on a low frequency modulation feature , 2001, INTERSPEECH.

[5]  Alessandra Flammini,et al.  Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach , 2002, EURASIP J. Adv. Signal Process..

[6]  Mohamad Izani Zainal Abidin,et al.  Emotion pitch variation analysis in Malay and English voice samples , 2003, 9th Asia-Pacific Conference on Communications (IEEE Cat. No.03EX732).

[7]  Hermann Ney,et al.  Computing Mel-frequency cepstral coefficients on the power spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  John H. L. Hansen,et al.  Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Jhing-Fa Wang,et al.  Unsupervised speaker change detection using SVM training misclassification rate , 2007, IEEE Transactions on Computers.

[11]  Jayme Garcia Arnal Barbedo,et al.  A robust and computationally efficient speech/music discriminator , 2006 .

[12]  H. V. Jagadish Content-Based Indexing and Retrieval , 1997, Handbook of Multimedia Information Management.

[13]  Jhing-Fa Wang,et al.  Robust Features for Effective Speech and Music Discrimination , 2008, ROCLING.

[14]  Damjan Vlaj,et al.  Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems , 2003, Int. J. Speech Technol..

[15]  Nitin Jhanwar,et al.  Pitch Correlogram Clustering for Fast Speaker Identification , 2004, EURASIP J. Adv. Signal Process..

[16]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[17]  Darinka Verdonik,et al.  BNSI Slovenian broadcast news database - speech and text corpus , 2005, INTERSPEECH.

[18]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[19]  Xavier Anguera Miró ROBUST SPEAKER DIARIZATION FOR MEETINGS , 2006 .

[20]  Wasfi G. Al-Khatib,et al.  Machine-learning based classification of speech and music , 2006, Multimedia Systems.

[21]  João Paulo da Silva Neto,et al.  The COST278 broadcast news segmentation and speaker clustering evaluation - overview, methodology, systems, results , 2005, INTERSPEECH.

[22]  Wen Gao,et al.  A fast and robust speech/music discrimination approach , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[23]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[24]  Tong Zhang,et al.  Content-Based Audio Classification and Retrieval for Audiovisual Data Parsing , 2001 .

[25]  Nicolás Ruiz-Reyes,et al.  Audio Coding Improvement Using Evolutionary Speech/Music Discrimination , 2007, 2007 IEEE International Fuzzy Systems Conference.

[26]  Dominique Fohr,et al.  Comparison of Two Speech/Music Segmentation Systems For Audio Indexing on the Web , 2002 .

[27]  Petr Motlícek,et al.  Unsupervised Speech/Non-Speech Detection for Automatic Speech Recognition in Meeting Rooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[28]  Thomas Niesler,et al.  Experiments in broadcast news transcription , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Sergios Theodoridis,et al.  A computationally efficient speech/music discriminator for radio recordings , 2006, ISMIR.

[31]  Abdullah I. Al-Shoshan,et al.  Speech and Music Classification and Separation: A Review , 2006 .

[32]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[33]  Ishwar K. Sethi,et al.  Classification of general audio data for content-based retrieval , 2001, Pattern Recognit. Lett..

[34]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[35]  John Backus,et al.  The Acoustical Foundations of Music , 1970 .

[36]  Xavier Anguera Miró,et al.  Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information , 2007, IEEE Transactions on Computers.

[37]  Katsuhiko Shirai,et al.  Detection of speech and music based on spectral tracking , 2008, Speech Commun..

[38]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[39]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[40]  S.M. Ahadi,et al.  Unsupervised speech/music classification using one-class support vector machines , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.