Meta-Learning for Speech Emotion Recognition Considering Ambiguity of Emotion Labels

Emotion labels in emotion recognition corpora are highly noisy and ambiguous, due to the annotators’ subjective perception of emotions. Such ambiguity may introduce errors in automatic classification and affect the overall performance. We therefore propose a dynamic label correction and sample contribution weight estimation model. Our model is based on a standard BLSTM model with attention with two extra parameters. The first learns a new corrected label distribution and aims to fix the inaccurate labels in the dataset. The other estimates the contribution of each sample to the training process and aims to ignore the ambiguous and noisy samples while giving higher weights to the clear ones. We train our model through an alternating optimization method, where in the first epoch we update the neural network parameters, and in the second we keep them fixed to update the label correction and sample importance parameters. When training and evaluating our model on the IEMOCAP dataset, we obtained a weighted accuracy (WA) and unweighted accuracy (UA) of 65.9% and 61.4%, respectively. This yielded an absolute improvement of 2.3% and 1.9%, respectively, compared to a BLSTM with attention baseline, trained on the corpus gold labels.

[1]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[3]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[4]  Runnan Li,et al.  Dilated Residual Network with Multi-head Self-attention for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Björn Schuller,et al.  Emotion recognition in the noise applying large acoustic feature sets , 2006, Speech Prosody 2006.

[6]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[7]  Kiyoharu Aizawa,et al.  Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Yusuke Ijima,et al.  Soft-Target Training with Ambiguous Emotional Utterances for DNN-Based Speech Emotion Classification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[11]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[12]  Chi-Chun Lee,et al.  Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile , 2019, INTERSPEECH.

[13]  Björn Schuller,et al.  Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition , 2019, INTERSPEECH.

[14]  Basilio Sierra,et al.  Feature Subset Selection Based on Evolutionary Algorithms for Automatic Emotion Recognition in Spoken Spanish and Standard Basque Language , 2006, TSD.

[15]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[17]  Roddy Cowie,et al.  Multimodal databases of everyday emotion: facing up to complexity , 2005, INTERSPEECH.

[18]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Ngoc Thang Vu,et al.  CycleGAN-Based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition , 2019, INTERSPEECH.