Acoustic emotion recognition using deep neural network

Traditionally acoustic emotion recognition system has been using Gaussian Mixture Models (GMMs) for classification. However, the Gaussian Mixture Models do not make good use of multiple frames of input data and can not exploit the high-dimensional dependencies of features efficiently, thus it's hard to improve the recognition accuracy for achieving a better result. Deep neural networks (DNNs) are artificial neural networks having more than one hidden layer, which are first pretrained layer by layer and then fine-tuned using back propagation algorithm. The well-trained deep neural networks are capable of modeling complex and non-linear features of input training data and can better predict the probability distribution over classification labels. In this paper, we used DNNs to replace GMMs in the recognition system architecture and conducted a series of experiments using neural networks that involved deep learning. Six discrete emotional states are classified based on these two kinds of classifiers. Our work focused on the performance of DNNs and experiments showed that the best recognition rate achieved by DNN-based system increased by 8.2 percentage points compared with baselines GMMs.

[1]  L. Auger The Journal of the Acoustical Society of America , 1949 .

[2]  Qiong Duan,et al.  Speech Emotion Recognition Using Gaussian Mixture Model , 2012 .

[3]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[4]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[5]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Youngmoo E. Kim,et al.  Learning emotion-based acoustic features with deep belief networks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[8]  Björn W. Schuller,et al.  Likability Classification - A Not so Deep Neural Network Approach , 2012, INTERSPEECH.

[9]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[12]  R.W. Schafer,et al.  From frequency to quefrency: a history of the cepstrum , 2004, IEEE Signal Processing Magazine.

[13]  A. Tanju Erdem,et al.  Improving automatic emotion recognition from speech signals , 2009, INTERSPEECH.

[14]  Emily Mower Provost,et al.  Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[16]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[17]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[18]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .