Discrimination between singing and speaking voices

Discriminating between singing and speaking voices by using the local and global characteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70% and 95% accuracy from 300 ms and one second long signals, respectively. From the subjective experiment results, assuming that different features are effective for shortterm and long-term signals, we designed two measures using a spectral envelope (MFCC) and the fundamental frequency (F0, perceived as pitch) contour. Experimental results show that the F0 measure performs better than the spectral envelope measure when the input voice signals are longer than one second. Particularly, it can discriminate singing and speaking voices with more than 80% accuracy with two-second signals. On the other hand, when the input signals are shorter than one second, the spectral envelope measure performs better than the F0 measure. Finally, by simply combining the two measures, more than 90% accuracy is obtained for two-second signals.

[1]  Hideki Kawahara,et al.  Scat singing generation using a versatile speech manipulation system, STRAIGHT , 2001 .

[2]  Michael J. Carey,et al.  Feature fusion for music detection , 1999, EUROSPEECH.

[3]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Masashi Unoki,et al.  Extraction of F0 dynamic characteristics and development of F0 control model in singing voice , 2002 .

[6]  J. Sundberg The acoustics of the singing voice. , 1977 .

[7]  Hervé Bourlard,et al.  Robust HMM-based speech/music segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Chilin Shih,et al.  Prosody control for speaking and singing styles , 2001, INTERSPEECH.

[9]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[10]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[11]  Youngmoo E. Kim Singing voice analysis/synthesis , 2003 .

[12]  David Gerhard Perceptual features for a fuzzy speech-song classification , 2002, ICASSP.

[13]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[14]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.