Formant position based weighted spectral features for emotion recognition

In this paper, we propose novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech. The idea is based on the fact that formant locations carry emotion-related information, and therefore critical spectral bands around formant locations can be emphasized during the calculation of MFCC features. The spectral weighting is derived from the normalized inverse harmonic mean function of the line spectral frequency (LSF) features, which are known to be localized around formant frequencies. The above approach can be considered as an early data fusion of spectral content and formant location information. We also investigate methods for late decision fusion of unimodal classifiers. We evaluate the proposed WMFCC features together with the standard spectral and prosody features using HMM based classifiers on the spontaneous FAU Aibo emotional speech corpus. The results show that unimodal classifiers with the WMFCC features perform significantly better than the classifiers with standard spectral features. Late decision fusion of classifiers provide further significant performance improvements.

[1]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[2]  Ruili Wang,et al.  Ensemble methods for spoken emotion recognition in call-centres , 2007, Speech Commun..

[3]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[4]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[5]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[6]  Ryohei Nakatsu,et al.  Emotion recognition and its application to computer agents with spontaneous interactive capabilities , 2000, Knowl. Based Syst..

[7]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  Alex Waibel,et al.  EMOTION-SENSITIVE HUMAN-COMPUTER INTERFACES , 2000 .

[9]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[10]  P. Boersma Praat : doing phonetics by computer (version 5.1.05) , 2009 .

[11]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[12]  Rajiv Laroia,et al.  Robust and efficient quantization of speech LSP parameters using structured vector quantizers , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[13]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[14]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[15]  A. Murat Tekalp,et al.  Multimodal speaker identification using an adaptive classifier cascade based on modality reliability , 2005, IEEE Transactions on Multimedia.

[16]  Klaus R. Scherer,et al.  Emotion dimensions and formant position , 2009, INTERSPEECH.

[17]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[18]  B. Schuller,et al.  Recognition of Spontaneous Emotions by Speech within Automotive Environment , 2006 .

[19]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  R.W. Morris,et al.  Modification of formants in the line spectrum domain , 2002, IEEE Signal Processing Letters.

[21]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[22]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[23]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[24]  Kjell Elenius,et al.  Automatic recognition of anger in spontaneous speech , 2008, INTERSPEECH.

[25]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[26]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[27]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[29]  Shrikanth S. Narayanan,et al.  Combining categorical and primitives-based emotion recognition , 2006, 2006 14th European Signal Processing Conference.

[30]  Ryohei Nakatsu,et al.  Emotion recognition and its application to computer agents with spontaneous interactive capabilities , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.