Automatic recognition of speech emotion using long-term spectro-temporal features

This paper proposes a novel feature type for the recognition of emotion from speech. The features are derived from a long-term spectro-temporal representation of speech. They are compared to short-term spectral features as well as popular prosodic features. Experimental results with the Berlin emotional speech database show that the proposed features outperform both types of compared features. An average recognition accuracy of 88.6% is achieved by using a combined proposed & prosodic feature set for classifying 7 discrete emotions. Moreover, the proposed features are evaluated on the VAM corpus to recognize continuous emotion primitives. Estimation performance comparable to human evaluations is furnished.

[1]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[2]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[3]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[5]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[6]  J. Kittler,et al.  Feature Set Search Alborithms , 1978 .

[7]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..

[8]  Bin Yang,et al.  Cascaded emotion classification via psychological emotion dimensions using a large set of voice quality parameters , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[10]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[13]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[14]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[15]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[16]  Tiago H. Falk,et al.  Long-term spectro-temporal information for improved automatic speech emotion classification , 2008, INTERSPEECH.

[17]  Björn W. Schuller,et al.  Combining frame and turn-level information for robust recognition of emotions within speech , 2007, INTERSPEECH.

[18]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[19]  G. Palm,et al.  Classifier fusion for emotion recognition from speech , 2007 .

[20]  T. Dau,et al.  Characterizing frequency selectivity for envelope fluctuations. , 2000, The Journal of the Acoustical Society of America.