A proposal for emotion recognition using speech features, transfer learning and convolutional neural networks

In this paper, we present a proposal for emotion recognition using audio speech signal features consisting of two functionally independent systems. First, a voice activity detection module (VAD) acts as a filter prior to the emotion classification task. It extracts features from the input audio and uses a SVM classifier to predict the presence of voice activity. Secondly, the speech emotion classifier (EMO) transforms the power spectrum of the signal to a Mel scale and obtains a vector of its characteristics using a convolutional neural network. Emotion labels are assigned using this vector and a KNN classifier. The RAVDESS dataset has been used for training the models obtaining a maximum accuracy of 93.57% classifying 8 emotions.

[1]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[2]  Fu Lee Wang,et al.  Speech emotion recognition based on DNN-decision tree SVM model , 2019, Speech Commun..

[3]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[4]  Mustaqeem,et al.  A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition , 2019, Sensors.

[5]  Kaya Oguz,et al.  Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers , 2020, Speech Commun..

[6]  David Griol,et al.  The Conversational Interface: Talking to Smart Devices , 2016 .

[7]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[8]  Rajiv Ratn Shah,et al.  Bagged support vector machines for emotion recognition from speech , 2019, Knowl. Based Syst..

[9]  Tibor Fegyó,et al.  Robust voice activity detection based on the entropy of noise-suppressed spectrum , 2005, INTERSPEECH.

[10]  Asif Ekbal,et al.  Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding , 2020, Expert Syst. Appl..

[11]  Nasser Kehtarnavaz,et al.  Automatic switching between noise classification and speech enhancement for hearing aid devices , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[12]  Adnan Yazici,et al.  Speech emotion recognition with deep convolutional neural networks , 2020, Biomed. Signal Process. Control..

[13]  Masato Akagi,et al.  Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model , 2019, 2019 IEEE International Conference on Signals and Systems (ICSigSys).

[14]  Jason C. Hung,et al.  Recognizing learning emotion based on convolutional neural networks and transfer learning , 2019, Appl. Soft Comput..

[15]  S. Lalitha,et al.  Emotion Detection Using MFCC and Cepstrum Features , 2015 .