Feature Analysis and Evaluation for Automatic Emotion Identification in Speech

The definition of parameters is a crucial step in the development of a system for identifying emotions in speech. Although there is no agreement on which are the best features for this task, it is generally accepted that prosody carries most of the emotional information. Most works in the field use some kind of prosodic features, often in combination with spectral and voice quality parametrizations. Nevertheless, no systematic study has been done comparing these features. This paper presents the analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions. In addition, early fusion and late fusion techniques for combining different information sources are evaluated. The results of this analysis are validated with experimental automatic emotion identification tests. Results suggest that spectral envelope features outperform the prosodic ones. Even when different parametrizations are combined, the late fusion of long-term spectral statistics with short-term spectral envelope parameters provides an accuracy comparable to that obtained when all parametrizations are combined.

[1]  Jon Sánchez,et al.  Meaningful Parameters in Emotion Characterisation , 2007, COST 2102 Workshop.

[2]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[3]  Elisabeth André,et al.  Improving Automatic Emotion Recognition from Speech via Gender Differentiaion , 2006, LREC.

[4]  D. Ruta,et al.  An Overview of Classifier Fusion Methods , 2000 .

[5]  Iker Luengo Detección de vocales mediante modelado o de clusters de fonemas , 2009, Proces. del Leng. Natural.

[6]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[7]  Björn Schuller,et al.  Enhanced Robustness in Speech Emotion Recognition Combining Acoustic and Semantic Analyses , 2004 .

[8]  Paavo Alku,et al.  Time-domain parameterization of the closing phase of glottal airflow waveform from voices over a large intensity range , 2002, IEEE Trans. Speech Audio Process..

[9]  Louis C. W. Pols,et al.  An acoustic description of consonant reduction , 1999, Speech Commun..

[10]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[12]  Jon Sánchez,et al.  Acoustical Analysis of Emotional Speech in Standard Basque for Emotions Recognition , 2004, CIARP.

[13]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  P. Ekman An argument for basic emotions , 1992 .

[15]  P. Alku,et al.  A method for generating natural-sounding speech stimuli for cognitive brain research , 1999, Clinical Neurophysiology.

[16]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[17]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[18]  Inma Hernáez,et al.  An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[20]  Patrick Verlinde,et al.  Multi-modal identity verification using support vector machines (SVM) , 2000, Proceedings of the Third International Conference on Information Fusion.

[21]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[22]  L. Devillers,et al.  F0 and pause features analysis for Anger and Fear detection in real-life spoken dialogs , 2004 .

[23]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[24]  Jon Sánchez,et al.  Evaluation of Pitch Detection Algorithms Under Real Conditions , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Fabien Ringeval,et al.  Exploiting a Vowel Based Approach for Acted Emotion Recognition , 2008, COST 2102 Workshop.

[26]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[27]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Panayiotis G. Georgiou,et al.  Real-time Emotion Detection System using Speech: Multi-modal Fusion of Different Timescale Features , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[29]  Ramón López-Cózar,et al.  Two-Level Fusion to Improve Emotion Classification in Spoken Dialogue Systems , 2008, TSD.

[30]  C. Darwin The Expression of the Emotions in Man and Animals , .

[31]  R. Cowie,et al.  Speech and Emotion: Proceedings of the International Speech Communication Association Research Workshop , 2000 .

[32]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[33]  C. Darwin,et al.  The Expression of the Emotions in Man and Animals , 1956 .

[34]  Ibon Saratxaga,et al.  Modified LTSE-VAD Algorithm for Applications Requiring Reduced Silence Frame Misclassification , 2010, LREC.

[35]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[36]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[37]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[38]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[39]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[40]  Julian Fiérrez,et al.  Adapted user-dependent multimodal biometric authentication exploiting general information , 2005, Pattern Recognit. Lett..

[41]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[42]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[43]  Eva Navas Cordón,et al.  Detección de vocales mediante modelado de clusters de fonemas , 2009 .

[44]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[45]  Inma Hernáez,et al.  Combining spectral and prosodic information for emotion recognition in the interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[46]  Donna Erickson,et al.  Expressive speech: Production, perception and application to speech synthesis , 2005 .

[47]  K. Scherer Psychological models of emotion. , 2000 .

[48]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[49]  Ralf Kompe,et al.  Emotional space improves emotion recognition , 2002, INTERSPEECH.

[50]  Masato Akagi,et al.  A three-layered model for expressive speech perception , 2008, Speech Commun..

[51]  Jon Sánchez,et al.  Automatic emotion recognition using prosodic parameters , 2005, INTERSPEECH.