论文信息 - Every Rating Matters: Joint Learning of Subjective Labels and Individual Annotators for Speech Emotion Classification

Every Rating Matters: Joint Learning of Subjective Labels and Individual Annotators for Speech Emotion Classification

Emotion perception is subjective and vary with respect to each individual due to the natural bias of human, such as gender, culture, and age. Conventionally, emotion recognition relies on the consensus, e.g., majority of annotations (hard label) or the distribution of annotations (soft label), and do not include rater-specific model. In this paper, we propose a joint learning methodology that simultaneously considers the label uncertainty and annotator idiosyncrasy using hard and soft emotion label annotation accompanying with individual and crowd annotator modeling. Our proposed model achieves unweighted average recall (UAR) 61.48% on the benchmark emotion corpus. Further analyses reveal that emotion perception is indeed rater-dependent, using the hard label and soft emotion distribution provides complementary affect modeling information, and finally joint learning of subjective emotion perception and individual rater model provides the best discriminative power.

Chi-Chun Lee | Huang-Cheng Chou | Chi-Chun Lee | Huang-Cheng Chou

[1] Björn W. Schuller,et al. From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty , 2017, ACM Multimedia.

[2] P A Turski,et al. Metabolic rate in the right amygdala predicts negative affect in depressed patients , 1998, Neuroreport.

[3] Yusuke Ijima,et al. Soft-Target Training with Ambiguous Emotional Utterances for DNN-Based Speech Emotion Classification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Stefan Winkler,et al. ASCERTAIN: Emotion and Personality Recognition Using Commercial Sensors , 2018, IEEE Transactions on Affective Computing.

[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6] Hsi-Pin Ma,et al. NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[7] S. Shott,et al. Emotion and Social Life: A Symbolic Interactionist Analysis , 1979, American Journal of Sociology.

[8] Margaret Lech,et al. Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[9] Min Chen,et al. AIWAC: affective interaction through wearable computing and cloud technology , 2015, IEEE Wireless Communications.

[10] A. Bechara. The role of emotion in decision-making: Evidence from neurological patients with orbitofrontal damage , 2004, Brain and Cognition.

[11] Bir Bhanu,et al. To skip or not to skip? A dataset of spontaneous affective response of online advertising (SARA) for audience behavior analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12] Ming Yang,et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[14] Emily Mower Provost,et al. Leveraging inter-rater agreement for audio-visual emotion recognition , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[15] Sigal G. Barsade. The Ripple Effect: Emotional Contagion and its Influence on Group Behavior , 2002 .

[16] P. Wesley Schultz,et al. Values and Proenvironmental Behavior , 1998 .

[17] Andrew Zisserman,et al. Temporal HeartNet: Towards Human-Level Automatic Analysis of Fetal Cardiac Screening Video , 2017, MICCAI.

[18] Seyedmahdad Mirsamadi,et al. Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[20] Carlos Busso,et al. Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[21] Geoffrey E. Hinton,et al. Who Said What: Modeling Individual Labelers Improves Classification , 2017, AAAI.