Emotion recognition of conversational affective speech using temporal course modeling-based error weighted cross-correlation model

A complete emotional expression in natural face-to-face conversation typically contains a complex temporal course. In this paper, we propose a temporal course modeling-based error weighted cross-correlation model (TCM-EWCCM) for speech emotion recognition. In TCM-EWCCM, a TCM-based cross-correlation model (CCM) is first used to not only model the temporal evolution of the extracted acoustic and prosodic features individually but also construct the statistical dependencies among paired acoustic-prosodic features in different emotional states. Then, a Bayesian classifier weighting scheme named error weighted classifier combination is adopted to explore the contributions of the individual TCM-based CCM classifiers for different acoustic-prosodic feature pairs to enhance the speech emotion recognition accuracy. The results of experiments on the NCKU-CASC corpus demonstrate that modeling the complex temporal structure and considering the statistical dependencies as well as contributions among paired features in natural conversation speech can indeed improve the speech emotion recognition performance.

[1]  Qiang Ji,et al.  A Unified Probabilistic Framework for Spontaneous Facial Action Modeling and Understanding , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[3]  Maja Pantic,et al.  Fully Automatic Facial Action Unit Detection and Temporal Analysis , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[6]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[7]  Chung-Hsien Wu,et al.  Two-Level Hierarchical Alignment for Semi-Coupled HMM-Based Audiovisual Emotion Recognition With Temporal Course , 2013, IEEE Transactions on Multimedia.

[8]  Ruili Wang,et al.  Ensemble methods for spoken emotion recognition in call-centres , 2007, Speech Commun..

[9]  Nikos Fakotakis,et al.  Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition , 2012, IEEE Transactions on Affective Computing.

[10]  Chung-Hsien Wu,et al.  Emotion recognition from text using semantic labels and separable mixture models , 2006, TALIP.

[11]  Chung-Hsien Wu,et al.  Speaking Effect Removal on Emotion Recognition From Facial Expressions Based on Eigenface Conversion , 2013, IEEE Transactions on Multimedia.

[12]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[13]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Chun Chen,et al.  A robust multimodal approach for emotion recognition , 2008, Neurocomputing.

[15]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[16]  Volume Assp,et al.  ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[17]  Larry E. Toothaker,et al.  Multiple Comparison Procedures , 1992 .

[18]  Chung-Hsien Wu,et al.  Exploiting Psychological Factors for Interaction Style Recognition in Spoken Conversation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[20]  Angeliki Metallinou,et al.  Audio-Visual Emotion Recognition Using Gaussian Mixture Models for Face and Voice , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[21]  P. De Silva,et al.  Handbook of Cognition and Emotion , 2001 .

[22]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[23]  Maja Pantic,et al.  Bimodal log-linear regression for fusion of audio and visual features , 2013, MM '13.

[24]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[25]  Chung-Hsien Wu,et al.  Interaction style detection based on Fused Cross-Correlation Model in spoken conversation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Chung-Hsien Wu,et al.  Facial action unit prediction under partial occlusion based on Error Weighted Cross-Correlation Model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Pierre Dumouchel,et al.  Anchor Models for Emotion Recognition from Speech , 2013, IEEE Transactions on Affective Computing.

[28]  Maja Pantic,et al.  Fully Automatic Recognition of the Temporal Phases of Facial Actions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[30]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[31]  Jon Sánchez,et al.  Automatic emotion recognition using prosodic parameters , 2005, INTERSPEECH.

[32]  Qiang Ji,et al.  Facial Action Unit Recognition by Exploiting Their Dynamic and Semantic Relationships , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[34]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[35]  Chung-Hsien Wu,et al.  Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels , 2015, IEEE Transactions on Affective Computing.

[36]  L. Hedges,et al.  The Handbook of Research Synthesis and Meta-Analysis , 2009 .

[37]  Yuri Ivanov,et al.  Probabilistic combination of multiple modalities to detect interest , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[38]  Qiang Ji,et al.  Simultaneous Facial Feature Tracking and Facial Expression Recognition , 2013, IEEE Transactions on Image Processing.

[39]  Panayiotis G. Georgiou,et al.  Real-time Emotion Detection System using Speech: Multi-modal Fusion of Different Timescale Features , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[40]  Nadia Mana,et al.  Modelling of emotional facial expressions during speech in synthetic talking heads using a hybrid approach , 2007, AVSP.

[41]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[42]  Chung-Hsien Wu,et al.  Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[43]  Shashidhar G. Koolagudi,et al.  Speech Emotion Recognition Using Segmental Level Prosodic Analysis , 2011, 2011 International Conference on Devices and Communications (ICDeCom).

[44]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[45]  Thomas Serre,et al.  Error weighted classifier combination for multi-modal human identification , 2004 .

[46]  Chung-Hsien Wu,et al.  Emotion Perception and Recognition from Speech , 2009, Affective Information Processing.