Selecting Training Data for Cross-Corpus Speech Emotion Recognition: Prototypicality vs. Generalization

We investigate strategies for selection of databases and instances for training cross-corpus emotion recognition systems, that is, systems that generalize across different labelling concepts, languages and interaction scenarios. We propose objective measures for prototypicality based on distances in a large space of brute-forced acoustic features and show their relation to the expected performance in cross-corpus testing. We perform extensive evaluation on eight commonly used corpora of emotional speech reaching from acted to fully natural emotion and limited phonetic content to conversational speech. In the result, selecting prototypical training instances by the proposed criterion can deliver a gain of up to 7.5 % unweighted accuracy in cross-corpus arousal recognition, and there is a correlation of .571 between the proposed prototypicality measure of databases and the expected unweighted accuracy in cross-corpus testing by Support Vector Machines.

[1]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[2]  Björn Schuller,et al.  Cross-Corpus Classification of Realistic Emotions - Some Pilot Experiments , 2010, LREC 2010.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[5]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Pietro Perona,et al.  Pruning training sets for learning of object categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[8]  Björn W. Schuller,et al.  Audiovisual Behavior Modeling by Combined Feature Spaces , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[10]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[11]  A. Tanju Erdem,et al.  RANSAC-based training data selection for emotion recognition from spontaneous speech , 2010, AFFINE '10.

[12]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.