Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

Speech emotion recognition is a vital and challenging task that the feature extraction plays a significant role in the SER performance. With the development of deep learning, we put our eyes on the structure of end-to-end and authenticate the algorithm that is extraordinary effective. In this paper, we introduce a novel architecture ADRNN (dilated CNN with residual block and BiLSTM based on the attention mechanism) to apply for the speech emotion recognition which can take advantage of the strengths of diverse networks and overcome the shortcomings of utilizing alone, and are evaluated in the popular IEMOCAP database and Berlin EMODB corpus. Dilated CNN can assist the model to acquire more receptive fields than using the pooling layer. Then, the skip connection can keep more historic info from the shallow layer and BiLSTM layer are adopted to learn long-term dependencies from the learned local features. And we utilize the attention mechanism to enhance further extraction of speech features. Furthermore, we improve the loss function to apply softmax together with the center loss that achieves better classification performance. As emotional dialogues are transformed of the spectrograms, we pick up the values of the 3-D Log-Mel spectrums from raw signals and put them into our proposed algorithm and obtain a notable performance to get the 74.96% unweighted accuracy in the speaker-dependent and the 69.32% unweighted accuracy in the speaker-independent experiment. It is better than the 64.74% from previous state-of-the-art methods in the spontaneous emotional speech of the IEMOCAP database. In addition, we propose the networks that achieve recognition accuracies of 90.78% and 85.39% on Berlin EMODB of speaker-dependent and speaker-independent experiment respectively, which are better than the accuracy of 88.30% and 82.82% obtained by previous work. For validating the robustness and generalization, we also make an experiment for cross-corpus between above databases and get the preferable 63.84% recognition accuracy in final.

[1]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[2]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[3]  Sven Behnke,et al.  Discovering hierarchical speech features using convolutional non-negative matrix factorization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[4]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Björn W. Schuller,et al.  Towards Temporal Modelling of Categorical Speech Emotion Recognition , 2018, INTERSPEECH.

[7]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[11]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  John Kim,et al.  Emotion Recognition from Human Speech Using Temporal Information and Deep Learning , 2018, INTERSPEECH.

[14]  Yongzhao Zhan,et al.  Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition , 2017, Speech Commun..

[15]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[17]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[18]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[19]  Turgut Özseven,et al.  A novel feature selection method for speech emotion recognition , 2019, Applied Acoustics.

[20]  Kandarpa Kumar Sarma,et al.  Emotion Identification from Raw Speech Signals Using DNNs , 2018, INTERSPEECH.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Lijiang Chen,et al.  Prominence features: Effective emotional features for speech emotion recognition , 2018, Digit. Signal Process..

[25]  Gwenn Englebienne,et al.  Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition , 2017, ACM Multimedia.

[26]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[27]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[31]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[32]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[33]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[34]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[35]  Jian Huang,et al.  Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function , 2018, INTERSPEECH.

[36]  Yu Zheng,et al.  Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition , 2018, INTERSPEECH.

[37]  Yongming Huang,et al.  Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition , 2015, IET Signal Process..

[38]  Siddique Latif,et al.  Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages , 2018, 2018 International Conference on Frontiers of Information Technology (FIT).

[39]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[40]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[41]  Enzo Pasquale Scilingo,et al.  Recognizing Emotions Induced by Affective Sounds through Heart Rate Variability , 2015, IEEE Transactions on Affective Computing.

[42]  Mohammad Mehdi Homayounpour,et al.  Gender aware Deep Boltzmann Machines for phone recognition , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[43]  Ming Li,et al.  An End-to-End Deep Learning Framework with Speech Emotion Recognition of Atypical Individuals , 2018 .

[44]  Gwenn Englebienne,et al.  Learning spectro-temporal features with 3D CNNs for speech emotion recognition , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[45]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.