Speech emotion recognition via ensembling neural networks

Deep Neural Network (DNN) based speech emotion recognition (SER) methods have demonstrated competitive performance compared to traditional SER approaches. However, from literatures, it can be seen that the confusion matrices of different SER methods varied a lot, which indicates that different DNN architecture has different capability of modeling different emotion cues from speech. It also means that single classifier hardly performs well on all speech emotion categories, which may be possibly due to data imbalance and the limitation of classifier. Motivated by the improved research results of ensemble learning, this paper investigates an ensemble method for SER via aggregating results from several base classifiers. In this study, considering the outstanding performance of Recurrent Neural Network (RNN) in different speech tasks and Residual network (ResNet) in image related classification, we chose RNN and ResNet acting as base classifiers. Experiments show that our proposed ensemble SER system outperforms the state-of-art single classifier- based SER system.

[1]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[2]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[7]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[8]  Padraig Cunningham,et al.  Diversity versus Quality in Classification Ensembles Based on Feature Selection , 2000, ECML.

[9]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[10]  Tsang-Long Pao,et al.  Comparison of Several Classifiers for Emotion Recognition from Noisy Mandarin Speech , 2007 .

[11]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.