Time-Frequency Feature and AMS-GMM Mask for Acoustic Emotion Classification

In this letter, the pH time-frequency vocal source feature is proposed for multistyle emotion identification. A binary acoustic mask is also used to improve the emotion classification accuracy. Emotional and stress conditions from the Berlin Database of Emotional Speech (EMO-DB) and Speech under Simulated and Actual Stress (SUSAS) databases are investigated in the experiments. In terms of emotion identification rates, the pH outperforms the mel-frequency cepstral coefficients (MFCC) and a Teager-Energy-Operator (TEO) based feature. Moreover, the acoustic mask achieves accuracy improvement for both the MFCC and the pH feature.

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[3]  Ning Wang,et al.  Robust Speaker Recognition Using Denoised Vocal Source and Vocal Tract Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Patrice Abry,et al.  A Wavelet-Based Joint Estimator of the Parameters of Long-Range Dependence , 1999, IEEE Trans. Inf. Theory.

[5]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[6]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[7]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[8]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[9]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[10]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[11]  John H. L. Hansen,et al.  A Novel Mask Estimation Method Employing Posterior-Based Representative Mean Estimate for Missing-Feature Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Rosângela Coelho,et al.  Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[14]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.