Acoustic emotion recognition based on fusion of multiple feature-dependent deep Boltzmann machines

In this paper, we present a method to improve the classification recall of a deep Boltzmann machine (DBM) on the task of emotion recognition from speech. The task involves the binary classification of four emotion dimensions such as arousal, expectancy, power, and valence. The method consists of dividing the features of the input data into separate sets and training each set individually using a deep Boltzmann machine algorithm. Afterwards, the results from each set are fused together using simple fusion. The final fused scores are compared to scores obtained from support vector machine (SVM) classifiers and from the same DBM algorithm on the full feature set. The results show that the proposed method can improve the performance of classification of four dimensions and is suitable for classification of unbalanced data sets.

[1]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[2]  Hynek Hermansky,et al.  Sparse Multilayer Perceptron for Phoneme Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[5]  E. M. Albornoz,et al.  Speech emotion recognition using a deep autoencoder , 2013 .

[6]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[9]  Nicolas Le Roux,et al.  Deep Belief Networks Are Compact Universal Approximators , 2010, Neural Computation.

[10]  Björn W. Schuller,et al.  Likability Classification - A Not so Deep Neural Network Approach , 2012, INTERSPEECH.

[11]  Shuzhi Sam Ge,et al.  Speaker state classification based on fusion of asymmetric simple partial least squares (SIMPLS) and support vector machines , 2014, Comput. Speech Lang..

[12]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[13]  Sarah Jane Delany,et al.  Benchmarking classification models for emotion recognition in natural speech: A multi-corporal study , 2011, Face and Gesture 2011.

[14]  Max Welling,et al.  Learning in Markov Random Fields with Contrastive Free Energies , 2005, AISTATS.

[15]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[16]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[17]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[19]  Wei Sun,et al.  A comparison of SVM and asymmetric SIMPLS in emotion recognition from naturalistic dialogues , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[20]  Razvan Pascanu,et al.  Autotagging music with conditional restricted Boltzmann machines , 2011, ArXiv.

[21]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[22]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.