EMOTIONS: WHAT IS POSSIBLE IN THE ASR FRAMEWORK

This paper discusses the possibilities to extract features from the speech signal that can be used for the detection of emotional state of the speaker, using the ASR framework. After the introduction, a short overview of the ASR framework is presented. Next, we discuss the relation between recognition of emotion and ASR, and the different approaches found in the literature to tackle the correspondence between emotions and acoustic features. The conclusion is that emotion itself will be very difficult to predict with high accuracy, but in ASR general prosodic information is potentially powerful to improve the (word) accuracy for tasks on a limited domain.

[1]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[2]  Mari Ostendorf,et al.  Computational models of the prosody/syntax mapping for spoken language systems , 1994 .

[3]  Roddy Cowie,et al.  Automatic statistical analysis of the signal and prosodic signs of emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Kate Hunicke-Smith,et al.  Effect of Speaking Style on LVCSR Performance , 1996 .

[5]  Keikiehi Hirose Disambiguating Recognition Results by Prosodic Features , 1997, Computing Prosody.

[6]  Sjl Mozziconacci,et al.  A study of intonation patterns in speech expressing emotion or attitude: production and perception , 1997 .

[7]  Hirotaka Suzuki,et al.  Prosodic parameters in emotional speech , 1998, ICSLP.

[8]  Nick Campbell,et al.  Acoustic nature and perceptual testing of corpora of emotional speech , 1998, ICSLP.

[9]  Juan Manuel Montero-Martínez,et al.  Emotional speech synthesis: from speech database to TTS , 1998, ICSLP.

[10]  Catherine I. Watson,et al.  Some acoustic characteristics of emotion , 1998, ICSLP.

[11]  Yang Li,et al.  Recognizing emotions in speech using short-term and long-term features , 1998, ICSLP.

[12]  Sharon L. Oviatt The CHAM model of hyperarticulate adaptation during human-computer error resolution , 1998, ICSLP.

[13]  Malcolm Slaney,et al.  Baby Ears: a recognition system for affective vocalizations , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14]  Erhard Rank,et al.  Generating emotional speech with a concatenative synthesizer , 1998, ICSLP.

[15]  Hagen Soltau,et al.  On the influence of hyperarticulated speech on recognition performance , 1998, ICSLP.

[16]  John H. L. Hansen,et al.  Linear and nonlinear speech feature analysis for stress classification , 1998, ICSLP.

[17]  N. Amir,et al.  Towards an automatic classification of emotions in speech , 1998, ICSLP.

[18]  Louis ten Bosch,et al.  Automatic detection of prominence (as defined by listeners' judgements) in read aloud dutch sentences , 1998, ICSLP.

[19]  Elisabeth Zetterholm Prosody and voice quality in the expression of emotions , 1998, ICSLP.

[20]  Sumi Shigeno Cultural similarities and differences in the recognition of audio-visual speech stimuli , 1998, ICSLP.

[21]  Sandra P. Whiteside,et al.  Simulated emotions: an acoustic study of voice and perturbation measures , 1998, ICSLP.

[22]  Osamu Mizuno,et al.  A new synthetic speech/sound control language , 1998, ICSLP.

[23]  Astrid Paeschke,et al.  Articulatory reduction in emotional speech , 1999, EUROSPEECH.

[24]  Julia Hirschberg,et al.  Prosodic cues to recognition errors , 1999 .

[25]  Julia Hirschberg,et al.  Predicting Automatic Speech Recognition Performance Using Prosodic Cues , 2000, ANLP.

[26]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.