Emotional Vocal Expressions Recognition Using the COST 2102 Italian Database of Emotional Speech

The present paper proposes a new speaker-independent approach to the classification of emotional vocal expressions by using the COST 2102 Italian database of emotional speech. The audio records extracted from video clips of Italian movies possess a certain degree of spontaneity and are either noisy or slightly degraded by an interruption making the collected stimuli more realistic in comparison with available emotional databases containing utterances recorded under studio conditions. The audio stimuli represent 6 basic emotional states: happiness, sarcasm/irony, fear, anger, surprise, and sadness. For these more realistic conditions, and using a speaker independent approach, the proposed system is able to classify the emotions under examination with 60.7% accuracy by using a hierarchical structure consisting of a Perceptron and fifteen Gaussian Mixture Models (GMM) trained to distinguish within each pair (couple) of emotions under examination. The best features in terms of high discriminative power were selected by using the Sequential Floating Forward Selection (SFFS) algorithm among a large number of spectral, prosodic and voice quality features. The results were compared with the subjective evaluation of the stimuli provided by human subjects.

[1]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[2]  Anna Esposito,et al.  The New Italian Audio and Video Emotional Database , 2009, COST 2102 Training School.

[3]  Takako Nishi,et al.  Physiology of Simple Photoreceptors in the Abdominal Ganglion of Onchidium , 2007, BVAI.

[4]  Shrikanth S. Narayanan,et al.  Emotion recognition using a data-driven fuzzy inference system , 2003, INTERSPEECH.

[5]  Constantine Kotropoulos,et al.  Automatic speech classification to five emotional states based on gender information , 2004, 2004 12th European Signal Processing Conference.

[6]  K. Scherer Vocal correlates of emotional arousal and affective disturbance. , 1989 .

[7]  John H. L. Hansen,et al.  Frequency band analysis for stress detection using a teager energy operator based feature , 2002, INTERSPEECH.

[8]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Antony William Rix,et al.  Perceptual evaluation of speech quality (PESQ): The new ITU standard for end-to-end speech quality a , 2002 .

[10]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  K. Scherer,et al.  Emotion Inferences from Vocal Expression Correlate Across Languages and Cultures , 2001 .

[13]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[14]  Nikolaos G. Bourbakis,et al.  The Significance of Empty Speech Pauses: Cognitive and Algorithmic Issues , 2007, BVAI.

[15]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[16]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[17]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[18]  Nikolaos G. Bourbakis,et al.  Cultural Specific Effects on the Recognition of Basic Emotions: A Study on Italian Subjects , 2009, USAB.

[19]  Marcos Faúndez-Zanuy,et al.  Data Fusion at Different Levels , 2009, COST 2102 School.

[20]  Christian Martyn Jones,et al.  Affective Human-Robotic Interaction , 2008, Affect and Emotion in Human-Computer Interaction.

[21]  Constantine Kotropoulos,et al.  Emotional Speech Classification Using Gaussian Mixture Models and the Sequential Floating Forward Selection Algorithm , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[22]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Inma Hernáez,et al.  An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  John A. Leese,et al.  The determination of cloud pattern motions from geosynchronous satellite image data , 1970, Pattern Recognit..

[25]  Anna Esposito,et al.  Text Independent Methods for Speech Segmentation , 2004, Summer School on Neural Networks.

[26]  Anna Esposito,et al.  Multimodal Signals: Cognitive and Algorithmic Issues, COST Action 2102 and euCognition International School Vietri sul Mare, Italy, April 21-26, 2008, Revised Selected and Invited Papers , 2009, COST 2102 School.

[27]  Günther Palm,et al.  Real-Time Emotion Recognition Using Echo State Networks , 2008, PIT.

[28]  Anna Esposito,et al.  A Speaker Independent Approach to the Classification of Emotional Vocal Expressions , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[29]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[30]  Paul Ekman,et al.  Facial Expressions of Emotion: New Findings, New Questions , 1992 .

[31]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[32]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[33]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[34]  Tsang-Long Pao,et al.  Emotion recognition from Mandarin speech signals , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[35]  Josef Kittler,et al.  Floating search methods for feature selection with nonmonotonic criterion functions , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[36]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[37]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[38]  Russell Beale,et al.  Affect and Emotion in Human-Computer Interaction, From Theory to Applications , 2008, Affect and Emotion in Human-Computer Interaction.

[39]  Anna Esposito,et al.  The COST 2102 Italian Audio and Video Emotional Database , 2009, WIRN.