Emotional speech recognition: Resources, features, and methods

In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, k-nearest neighbors, support vector machines are reviewed.

[1]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[2]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[4]  Elmar Nöth,et al.  Using speech and gesture to explore user states in multimodal dialogue systems , 2003, AVSP.

[5]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[6]  Alex Waibel,et al.  Detecting Emotions in Speech , 1998 .

[7]  M. Sondhi,et al.  New methods of pitch extraction , 1968 .

[8]  Florian Schiel,et al.  The SmartKom Multimodal Corpus at BAS , 2002, LREC.

[9]  Juan Carlos,et al.  Review of "Discrete-Time Speech Signal Processing - Principles and Practice", by Thomas Quatieri, Prentice-Hall, 2001 , 2003 .

[10]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[11]  Dik J. Hermes,et al.  Expression of emotion and attitude through temporal speech variations , 2000, INTERSPEECH.

[12]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[13]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[14]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[15]  Sadaoki Furui,et al.  Advances in Speech Signal Processing , 1991 .

[16]  J. Montero,et al.  ANALYSIS AND MODELLING OF EMOTIONAL SPEECH IN SPANISH , 1999 .

[17]  D. Mitchell Wilkes,et al.  Acoustical properties of speech as indicators of depression and suicidal risk , 2000, IEEE Transactions on Biomedical Engineering.

[18]  Isabel Trancoso,et al.  Spoken Language Corpora for Speech Recognition and Synthesis in European Portuguese , 1998 .

[19]  Mark Huckvale,et al.  Improvements in Speech Synthesis , 2001 .

[20]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[21]  R. Buck,et al.  The biological affects: a typology. , 1999, Psychological review.

[22]  Nick Campbell,et al.  A Speech Synthesis System with Emotion for Assisting Communication , 2000 .

[23]  Ryohei Nakatsu,et al.  Emotion recognition and its application to computer agents with spontaneous interactive capabilities , 1999, MULTIMEDIA '99.

[24]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[25]  Johannes Wagner,et al.  From Physiological Signals to Emotions: Implementing and Comparing Selected Methods for Feature Extraction and Classification , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[26]  Jiahong Yuan,et al.  The acoustic realization of anger, fear, joy and sadness in Chinese , 2002, INTERSPEECH.

[27]  Chung-Hsien Wu,et al.  Emotion recognition from textual input using an emotional semantic network , 2002, INTERSPEECH.

[28]  Alex Waibel,et al.  EMOTION-SENSITIVE HUMAN-COMPUTER INTERFACES , 2000 .

[29]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[30]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[31]  Constantine Kotropoulos,et al.  Automatic speech classification to five emotional states based on gender information , 2004, 2004 12th European Signal Processing Conference.

[32]  Mike Edgington,et al.  Investigating the limitations of concatenative synthesis , 1997, EUROSPEECH.

[33]  K. Scherer,et al.  Effect of experimentally induced stress on vocal parameters. , 1986, Journal of experimental psychology. Human perception and performance.

[34]  Ioannis Pitas,et al.  Automatic emotional speech classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  John L. Arnott,et al.  Synthesizing emotions in speech: is it time to get excited? , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[37]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[38]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[39]  Åsa Abelin,et al.  Cross linguistic interpretation of emotional prosody , 2002 .

[40]  John H. L. Hansen,et al.  Frequency band analysis for stress detection using a teager energy operator based feature , 2002, INTERSPEECH.

[41]  Lianhong Cai,et al.  Speech emotion classification with the combination of statistic features and temporal features , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[42]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[43]  Mohamad Mrayati,et al.  Distinctive regions and modes: a new theory of speech production , 1988, Speech Commun..

[44]  I. Linnankoski,et al.  Expression or emotional-motivational connotations with a one-word utterance. , 1997, The Journal of the Acoustical Society of America.

[45]  Roddy Cowie,et al.  Automatic recognition of emotion from voice: a rough benchmark , 2000 .

[46]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[47]  Chloé Clavel,et al.  Fiction database for emotion detection in abnormal situations , 2004, INTERSPEECH.

[48]  Klaus R. Scherer,et al.  A cross-cultural investigation of emotion inferences from voice and speech: implications for speech technology , 2000, INTERSPEECH.

[49]  Frank Dellaert,et al.  Recognizing emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[50]  Shubha Kadambe,et al.  Application of the wavelet transform for pitch detection of speech signals , 1992, IEEE Trans. Inf. Theory.

[51]  John H. L. Hansen,et al.  Text-directed speech enhancement using phoneme classification and feature map constrained vector quantization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[52]  John H. L. Hansen,et al.  Speech under stress conditions: overview of the effect on speech production and on system performance , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[53]  Wolfgang J. Hess,et al.  Pitch and voicing determination , 1992 .

[54]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[55]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[56]  Valery A. Petrushin,et al.  RUSLANA: a database of Russian emotional utterances , 2002, INTERSPEECH.

[57]  Malcolm Slaney,et al.  BabyEars: A recognition system for affective vocalizations , 2003, Speech Commun..

[58]  John H. L. Hansen,et al.  N-channel hidden Markov models for combined stressed speech classification and recognition , 1999, IEEE Trans. Speech Audio Process..

[59]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[60]  Jennifer Healey,et al.  Toward Machine Emotional Intelligence: Analysis of Affective Physiological State , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation (3rd Edition) , 2007 .

[62]  T. Subba Rao,et al.  Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB , 2004 .

[63]  A. Friederici,et al.  Accentuation and emotions - two different systems? , 2000 .

[64]  Constantine Kotropoulos,et al.  Emotional Speech Classification Using Gaussian Mixture Models and the Sequential Floating Forward Selection Algorithm , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[65]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  M. Alpert,et al.  Reflections of depression in acoustic measures of the patient's speech. , 2001, Journal of affective disorders.

[67]  Klaus R. Scherer,et al.  Acoustic correlates of task load and stress , 2002, INTERSPEECH.

[68]  Zhigang Deng,et al.  An acoustic study of emotions expressed in speech , 2004, INTERSPEECH.

[69]  Frederick Jelinek,et al.  Continuous speech recognition , 1977, SGAR.

[70]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[71]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[72]  John H. L. Hansen,et al.  Classification of speech under stress using target driven features , 1996, Speech Commun..

[73]  Barbara Heuft,et al.  Emotions in time domain synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[74]  Roddy Cowie,et al.  Automatic statistical analysis of the signal and prosodic signs of emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[75]  P. Ekman An argument for basic emotions , 1992 .

[76]  Sjl Mozziconacci,et al.  A study of intonation patterns in speech expressing emotion or attitude: production and perception , 1997 .

[77]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[78]  Synnöve Carlson,et al.  Conveyance of emotional connotations by a single word in English , 2005, Speech Commun..

[79]  I. Iriondo,et al.  VALIDATION OF AN ACOUSTICAL MODELLING OF EMOTIONAL EXPRESSION IN SPANISH USING SPEECH SYNTHESIS TECHNIQUES , 2000 .

[80]  Marc Schröder,et al.  Experimental study of affect bursts , 2003, Speech Commun..

[81]  John H. L. Hansen,et al.  Nonlinear analysis and classification of speech under stressed conditions , 1994 .

[82]  Björn Granström,et al.  Measurements of articulatory variation in expressive speech for a set of Swedish vowels , 2004, Speech Commun..

[83]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[84]  Harry Shum,et al.  Emotion Detection from Speech to Enrich Multimedia Content , 2001, IEEE Pacific Rim Conference on Multimedia.

[85]  Chong-Kwan Un,et al.  On Predictive Coding of Speech Signals , 1985 .

[86]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[87]  A. Lloyd,et al.  Comprehension of Prosody in Parkinson's Disease , 1999, Cortex.

[88]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[89]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[90]  Carlo Drioli,et al.  Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions , 2004, Speech Commun..

[91]  Peter Geach,et al.  Descartes. Philosophical Writings. , 1957 .

[92]  Petros Maragos,et al.  A system for finding speech formants and modulations via energy separation , 1994, IEEE Trans. Speech Audio Process..

[93]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[94]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[95]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[96]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[97]  Ralf Kompe,et al.  Emotional space improves emotion recognition , 2002, INTERSPEECH.

[98]  Masahiro Araki,et al.  Synthesis of emotional speech using prosodically balanced VCV segments , 2001, SSW.

[99]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[100]  R. Stibbard AUTOMATED EXTRACTION OF ToBI ANNOTATION DATA FROM THE READING / LEEDS EMOTIONAL SPEECH CORPUS , 2000 .

[101]  M. Landau Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk , 2008 .

[102]  D Cairns,et al.  NONLINEAR ANALYSIS AND DETECTION OF SPEECH UNDER STRESSED CONDITIONS , 1994 .

[103]  Cecile Pereira DIMENSIONS OF EMOTIONAL MEANING IN SPEECH , 2000 .

[104]  Rosalind W. Picard,et al.  Modeling drivers' speech under stress , 2003, Speech Commun..

[105]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[106]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[107]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[108]  John H. L. Hansen,et al.  HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress , 1998, IEEE Trans. Speech Audio Process..

[109]  John H. L. Hansen,et al.  ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments , 1995, Speech Commun..

[110]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[111]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[112]  H. Akaike A new look at the statistical model identification , 1974 .

[113]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[114]  Shrikanth S. Narayanan,et al.  Reference marking in children's computer-directed speech: an integrated analysis of discourse and gestures , 2004, INTERSPEECH.

[115]  N. Amir,et al.  Analysis of an emotional speech corpus in Hebrew based on objective criteria , 2000 .

[116]  Marc Schröder,et al.  Expressing vocal effort in concatenative synthesis , 2003 .