Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition

Speech is the most effective way for people to exchange complex information. Recognition of emotional information contained in speech is one of the important challenges in the field of artificial intelligence. To better acquire emotional features in speech signals, a parallelized convolutional recurrent neural network (PCRN) with spectral features is proposed for speech emotion recognition. First, frame-level features are extracted from each utterance and, a long short-term memory is employed to learn these features frame by frame. At the same time, the deltas and delta–deltas of the log Mel-spectrogram are calculated and reconstructed into three channels (static, delta, and delta–delta); these 3-D features are learned by a convolutional neural network (CNN). Then, the two learned high-level features are fused and batch normalized. Finally, a SoftMax classifier is used to classify emotions. Our PCRN model simultaneously processes two different types of features in parallel to better learn the subtle changes in emotion. The experimental results on four public datasets show the superiority of our proposed method, which is better than the previous works.

[1]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[2]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[5]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[6]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[8]  Min Wu,et al.  Speech emotion recognition based on an improved brain emotion learning model , 2018, Neurocomputing.

[9]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[10]  Björn W. Schuller,et al.  Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition , 2016, IEEE Access.

[11]  Björn W. Schuller,et al.  Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Zhong-Qiu Wang,et al.  Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Peng Song,et al.  Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization , 2016, Speech Commun..

[15]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Keshi Dai,et al.  Recognizing emotion in speech using neural networks , 2008 .

[17]  Constantine Kotropoulos,et al.  Emotional Speech Classification Using Gaussian Mixture Models and the Sequential Floating Forward Selection Algorithm , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[18]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Björn W. Schuller,et al.  Audiovisual Behavior Modeling by Combined Feature Spaces , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Fabien Ringeval,et al.  Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models , 2017, IEEE Transactions on Affective Computing.

[24]  Tsang-Long Pao,et al.  A Comparative Study of Different Weighting Schemes on KNN-Based Emotion Recognition in Mandarin Speech , 2007, ICIC.

[25]  Yonghong Yan,et al.  Speech Emotion Recognition Using Both Spectral and Prosodic Features , 2009, 2009 International Conference on Information Engineering and Computer Science.

[26]  Ya Li,et al.  CHEAVD: a Chinese natural emotional audio–visual database , 2016, Journal of Ambient Intelligence and Humanized Computing.

[27]  John Kane,et al.  Speaker and language independent voice quality classification applied to unlabelled corpora of expressive speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Sascha Meudt,et al.  Prosodic, Spectral and Voice Quality Feature Selection Using a Long-Term Stopping Criterion for Audio-Based Emotion Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[29]  Danyang Li,et al.  Random Deep Belief Networks for Recognizing Emotions from Speech Signals , 2017, Comput. Intell. Neurosci..

[30]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[31]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[32]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Guihua Wen,et al.  Weighted spectral features based on local Hu moments for speech emotion recognition , 2015, Biomed. Signal Process. Control..

[34]  Pascale Fung,et al.  A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[36]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[37]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[38]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.