Speech Emotion Recognition using Convolutional Recurrent Neural Networks and Spectrograms

In this study, a speech emotion recognition technique based on a deep learning neural network that uses the King Saud University Emotions' Arabic dataset is presented. The convolutional neural network and long short-term memory (LSTM) are used to design the primary system of the convolutional recurrent neural network (CRNN). This study further investigates the use of linearly spaced spectrograms as inputs to the emotional speech recognizers. The performance of the CRNN system is compared with the results obtained through an experiment evaluating the human capability to perceive the emotion from speech. This human perceptual evaluation is considered as the baseline system. The overall CRNN system achieves 84.55% and 77.51% accuracies for file and segment levels, respectively. These values of accuracy are considerably close to the human emotion perception scores.

[1]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Peng Xiao,et al.  Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs , 2019, ICIRA.

[3]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[4]  Emily Mower Provost,et al.  Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[8]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[9]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Sid-Ahmed Selouani,et al.  Evaluation of an Arabic Speech Corpus of Emotions: A Perceptual and Statistical Analysis , 2018, IEEE Access.

[11]  Li Deng,et al.  A tutorial survey of architectures, algorithms, and applications for deep learning , 2014, APSIPA Transactions on Signal and Information Processing.

[12]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[13]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[14]  Xiaoyuan Yi,et al.  Inferring users' emotions for human-mobile voice dialogue applications , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[15]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.