End-to-End Speech Emotion Recognition Based on One-Dimensional Convolutional Neural Network

Real-time speech emotion recognition has always been a problem. To this end, we proposed an end-to-end speech emotion recognition model based on one-dimensional convolutional neural network, which contains only three convolution layers, two pooling layers and one full-connected layer. Through Adam optimization algorithm and back propagation mechanism, more discriminative features can be extracted continuously. Our model is quite simple in structure and easy to quickly complete the emotional classification task. Compared with traditional methods, there is no need to carry out the complex process of manually extracting features, and the model can automatically learn the emotional features from raw speech signals. In the emotional recognition experiments with EMODB, CASIA, IEMOCAP, and CHEAVD four speech databases, relatively high recognition rates were obtained. Experiments show that the proposed algorithm is of great benefit to the implementation of real-time speech emotion recognition.

[1]  Lutz Prechelt,et al.  Automatic early stopping using cross validation: quantifying the criteria , 1998, Neural Networks.

[2]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[6]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[8]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[9]  Jianwu Dang,et al.  Auditory-Inspired End-to-End Speech Emotion Recognition Using 3D Convolutional Recurrent Neural Networks Based on Spectral-Temporal Representation , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[10]  Malay Kishore Dutta,et al.  Speech emotion recognition with deep learning , 2017, 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).

[11]  Gang Liu,et al.  Advanced LSTM: A Study About Better Time Dependency Modeling in Emotion Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[13]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[14]  Ya Li,et al.  CHEAVD: a Chinese natural emotional audio–visual database , 2016, Journal of Ambient Intelligence and Humanized Computing.

[15]  Pascale Fung,et al.  A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[17]  Chia-Ping Chen,et al.  Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).