On Acoustic Emotion Recognition: Compensating for Covariate Shift

Pattern recognition tasks often face the situation that training data are not fully representative of test data. This problem is well-recognized in speech recognition, where methods like cepstral mean normalization (CMN), vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) are used to compensate for channel and speaker differences. Speech emotion recognition (SER) is an important emerging field in human-computer interaction and faces the same data shift problems, a fact which has been generally overlooked in this domain. In this paper, we show that compensating for channel and speaker differences can give significant improvements in SER by modelling these differences as a covariate shift. We employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift. We test these methods on the FAU Aibo Emotion Corpus, which was used in the Interspeech 2009 Emotion Challenge. It consists of two separate parts recorded independently at different schools; hence the two parts exhibit covariate shift. Results show that the IW methods outperform combined CMN and VTLN and significantly improve on the baseline performance of the Challenge. The best of the three methods also improves significantly on the winning contribution to the Challenge.

[1]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  S. Jovi Serbian emotional speech database : design , processing and evaluation , 2004 .

[4]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[5]  Léon J. M. Rothkrantz,et al.  Emotion Recognition from Speech by Combining Databases and Fusion of Classifiers , 2010, TSD.

[6]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[7]  Joan Claudi Socoró,et al.  GTM-URL contribution to the INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[8]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[9]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation , 2008, SDM.

[10]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[11]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[12]  Björn W. Schuller,et al.  Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote? , 2011, INTERSPEECH.

[13]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[14]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[15]  Cynthia Breazeal,et al.  Recognition of Affective Communicative Intent in Robot-Directed Speech , 2002, Auton. Robots.

[16]  Mahesan Niranjan,et al.  Machine learning for intrusion detection: Modeling the distribution shift , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[17]  Wolfgang Wahlster,et al.  SmartKom: Foundations of Multimodal Dialogue Systems , 2006, SmartKom.

[18]  Inma Hernáez,et al.  Combining spectral and prosodic information for emotion recognition in the interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[19]  Björn Schuller,et al.  Cross-Corpus Classification of Realistic Emotions - Some Pilot Experiments , 2010, LREC 2010.

[20]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[21]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[22]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[23]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[24]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[25]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[26]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[27]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[28]  Klaus-Robert Müller,et al.  Importance-Weighted Cross-Validation for Covariate Shift , 2006, DAGM-Symposium.

[29]  Emiel Krahmer,et al.  Real vs. acted emotional speech , 2006, INTERSPEECH.

[30]  J. Barker,et al.  An approach to vocal tract length normalization by robust formant , 2010 .

[31]  Robert I. Damper,et al.  Classification of emotional speech using 3DEC hierarchical classifier , 2012, Speech Commun..

[32]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[33]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..

[34]  Malcolm Slaney,et al.  BabyEars: A recognition system for affective vocalizations , 2003, Speech Commun..