Speech emotion recognition model based on Bi-GRU and Focal Loss

ABSTRACT For the problems of inconsistent sample duration and unbalance of sample categories in the speech emotion corpus, this paper proposes a speech emotion recognition model based on Bi-GRU (Bidirection Gated Recurrent Unit) and Focal Loss. The model has been improved on the basis of learning CRNN (Convolutional Recurrent Neural Network) deeply. In CRNN, Bi-GRU is used to effectively lengthen the samples of the speech with short duration, and Focal Loss function is used to deal with the difficulties in classification caused by the imbalance of emotional categories of the samples. Through different methods for experimental comparison, weighted average recall (WAR), unweighted average recall (UAR) and confusion matrix (CM) are used as evaluation index of the algorithm. The experimental results show that the speech emotion recognition model proposed in this paper improves the recognition accuracy and the imbalance of IEMOCAP database samples, and can effectively prove that the improvement of speech emotion recognition performance is not due to the adjustment of model parameters or the change of the model topology.

[1]  Xiong Luo,et al.  Attention-Based Relation Extraction With Bidirectional Gated Recurrent Unit and Highway Network in the Analysis of Geological Data , 2018, IEEE Access.

[2]  Jason J. Corso,et al.  A Temporally-Aware Interpolation Network for Video Frame Inpainting , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Junshan Li,et al.  Age estimation algorithm of facial images based on multi-label sorting , 2018, EURASIP J. Image Video Process..

[5]  Luiz Eduardo Soares de Oliveira,et al.  An evaluation of Convolutional Neural Networks for music classification using spectrograms , 2017, Appl. Soft Comput..

[6]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[8]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[9]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[10]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Margaret Lech,et al.  Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[12]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[13]  Olivier Gibaru,et al.  EMG-based online classification of gestures with recurrent neural networks , 2019, Pattern Recognit. Lett..

[14]  Haiyong Zheng,et al.  KA-Ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling , 2019, Multimedia Tools and Applications.

[15]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[16]  Jianhua Tao,et al.  Semi-supervised Ladder Networks for Speech Emotion Recognition , 2019, International Journal of Automation and Computing.

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Julio López,et al.  An alternative SMOTE oversampling strategy for high-dimensional datasets , 2019, Appl. Soft Comput..

[19]  Chao Wang,et al.  Speech emotion recognition using emotion perception spectral feature , 2019, Concurr. Comput. Pract. Exp..

[20]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[22]  Björn W. Schuller,et al.  Semisupervised Autoencoders for Speech Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Christian Poellabauer,et al.  Deviations of acoustic low-level descriptors in speech features of a set of triplets, one with autism , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).