Speech Emotion Recognition with Hybrid Neural Network

With rapid development of deep learning technology, great progress has been made in many areas. Convolutional Neural Networks(CNNs) has achieved unprecedented success in the field of computer vision. Recurrent Neural Network(RNNs) and the Attention Mechanism work well for time series tasks. Through investigation a speech emotion recognition(SER) model is proposed in this paper, which based on the CNN, the Long short-term memory(LSTM) and the Attention Mechanism without using any traditional hand-crafted features. Meanwhile, to expand the data set, a new flipping method was proposed for data enhancement. By applying the proposed model and the new data enhancement method to the emotional speech database, the classification result was verified to have better accuracy.

[1]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[2]  S. Lalitha,et al.  Speech emotion recognition , 2014, 2014 International Conference on Advances in Electronics Computers and Communications.

[3]  Yonghong Yan,et al.  Speech Emotion Recognition Using Both Spectral and Prosodic Features , 2009, 2009 International Conference on Information Engineering and Computer Science.

[4]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[6]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[7]  Wang Fei,et al.  Research on speech emotion recognition based on deep auto-encoder , 2016, 2016 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[10]  Björn Schuller,et al.  THE UP SYSTEM FOR THE 2016 DCASE CHALLENGE USING DEEP RECURRENT NEURAL NETWORK AND MULTISCALE KERNEL SUBSPACE LEARNING , 2016 .

[11]  Muhammad Huzaifah,et al.  Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks , 2017, ArXiv.

[12]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[13]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[14]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[15]  Amit Sharma,et al.  Speech Emotion Recognition , 2015 .

[16]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[17]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[18]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).