Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels

Ground truth labels obtained by averaging or majority voting are commonly used to train automatic emotion classifiers. However, ground truth labels fail to encapsulate inter-annotator variability and ignore the subjectivity of emotions. In this paper, we propose two viable approaches to model the subjectiveness of emotions by incorporating inter-annotator variability, which are soft labels and model ensembling, where each model represents an annotator. Using a deep neural network that recognizes emotions in real-time from one second windows of speech spectrograms, we demonstrate that both approaches lead to consistent improvement over using ground truth labels. It is empirically shown that the performance gain of the ensemble over the baseline model could be achieved using soft labels generated from multiple annotators.

[1]  P. Kleinginna,et al.  A categorized list of emotion definitions, with suggestions for a consensual definition , 1981 .

[2]  P. Shaver,et al.  Emotion knowledge: further exploration of a prototype approach. , 1987, Journal of personality and social psychology.

[3]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[4]  Rosalind W. Picard,et al.  A computational model for the automatic recognition of affect in speech , 2004 .

[5]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[7]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Björn W. Schuller,et al.  Patterns, prototypes, performance: classifying emotional user states , 2008, INTERSPEECH.

[10]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[11]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[12]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[13]  Björn Schuller,et al.  Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies , 2010, IEEE Transactions on Affective Computing.

[14]  Rosalind W. Picard Emotion Research by the People, for the People , 2010 .

[15]  Björn W. Schuller,et al.  A multitask approach to continuous five-dimensional affect sensing in natural speech , 2012, TIIS.

[16]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Carlos Busso,et al.  Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition , 2013, IEEE Transactions on Affective Computing.

[18]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Cross-Validation , 2014 .

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Chaitali Chakrabarti,et al.  A multi-modal approach to emotion recognition using undirected topic models , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[23]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[24]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[25]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[26]  Margaret Lech,et al.  Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[27]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).