Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms

We present new methods that extract characteristic features from speech magnitude spectrograms. Two of the presented approaches have been found particularly efficient in the process of automatic stress and emotion classification. In the first approach, the spectrograms are sub-divided into ERB frequency bands and the average energy for each band is calculated. In the second approach, the spectrograms are passed through a bank of 12 log-Gabor filters and the outputs are averaged and passed through an optimal feature selection procedure based on mutual information criteria. The proposed methods were tested using single vowels, words and sentences from SUSAS data base with 3 classes of stress, and spontaneous speech recordings made by psychologists (ORI) with 5 emotional classes. The classification results based on the Gaussian mixture model show correct classification rates of 40%-81%, for different SUSAS data sets and 40%-53.4% for the ORI data base.

[1]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[2]  J. Lynch,et al.  Speech/Silence segmentation for real-time coding via rule based adaptive endpoint detection , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Margaret Lech,et al.  Emotion Recognition in Speech of Parents of Depressed Adolescents , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[4]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[5]  M. Kleinschmidt Methods for capturing spectro-temporal modulations in automatic speech recognition , 2001 .

[6]  Nicholas B. Allen,et al.  Recognition of stress in speech using wavelet analysis and Teager energy operator , 2008, INTERSPEECH.

[7]  N. Allen,et al.  Emotion Recognition in Spontaneous Speech within Work and Family Environments , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[8]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[9]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[10]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[11]  Volker Hohmann,et al.  Sub-band SNR estimation using auditory feature processing , 2003, Speech Commun..

[12]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[13]  Nicholas B. Allen,et al.  Stress Detection Using Speech Spectrograms and Sigma-pi Neuron Units , 2009, 2009 Fifth International Conference on Natural Computation.

[14]  H. Hops,et al.  Adolescent Responses to Depressive Parental Behaviors in Problem-Solving Interactions: Implications for Depressive Symptoms , 2000, Journal of abnormal child psychology.

[15]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[16]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Tony Ezzat,et al.  Localized spectro-temporal cepstral analysis of speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Tony Ezzat,et al.  Discriminative word-spotting using ordered spectro-temporal patch features , 2008, SAPA@INTERSPEECH.