Speech Emotion Recognition Using Affective Saliency

We investigate an affective saliency approach for speech emotion recognition of spoken dialogue utterances that estimates the amount of emotional information over time. The proposed saliency approach uses a regression model that combines features extracted from the acoustic signal and the posteriors of a segment-level classifier to obtain frame or segment-level ratings. The affective saliency model is trained using a minimum classification error (MCE) criterion that learns the weights by optimizing an objective loss function related to the classification error rate of the emotion recognition system. Affective saliency scores are then used to weight the contribution of frame-level posteriors and/or features to the speech emotion classification decision. The algorithm is evaluated for the task of anger detection on four call-center datasets for two languages, Greek and English, with good results.

[1]  Zheng-Hua Tan,et al.  Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[2]  Shrikanth S. Narayanan,et al.  Classification of cognitive load from speech using an i-vector framework , 2014, INTERSPEECH.

[3]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .

[4]  Dilek Z. Hakkani-Tür,et al.  Using context to improve emotion detection in spoken dialog systems , 2005, INTERSPEECH.

[5]  Amir Dembo,et al.  A minimum discrimination information approach for hidden Markov modeling , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Shrikanth Narayanan,et al.  Toward Effective Automatic Recognition Systems of Emotion in Speech , 2014 .

[7]  Panayiotis G. Georgiou,et al.  Real-time Emotion Detection System using Speech: Multi-modal Fusion of Different Timescale Features , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[8]  Shrikanth Narayanan,et al.  Recognition of negative emotions from the speech signal , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[10]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[11]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  D. Ruta,et al.  An Overview of Classifier Fusion Methods , 2000 .

[13]  Ioannis Pitas,et al.  Automatic emotional speech classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[16]  Björn Schuller,et al.  Automatic Emotion Recognition by the Speech Signal , 2002 .

[17]  Anna Esposito,et al.  Classification of emotional speech units in call centre interactions , 2013, 2013 IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom).

[18]  B. Juang,et al.  A study on minimum error discriminative training for speaker recognition , 1995 .

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[21]  Chin-Hui Lee,et al.  Multiple time resolution analysis of speech signal using MCE training with application to speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[23]  Isabel Trancoso,et al.  Age and gender classification using fusion of acoustic and prosodic features , 2010, INTERSPEECH.

[24]  Alessandro Vinciarelli,et al.  Automatic personality perception: Prediction of trait attribution based on prosodic features extended abstract , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[25]  Ming Li,et al.  Automatic recognition of speaker physical load using posterior probability based features from acoustic and phonetic tokens , 2014, INTERSPEECH.

[26]  Wolfgang Minker,et al.  A Parameterized and Annotated Spoken Dialog Corpus of the CMU Let’s Go Bus Information System , 2012, LREC.

[27]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..