Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks

Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.

[1]  Shrikanth S. Narayanan,et al.  A hierarchical static-dynamic framework for emotion classification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[3]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[4]  Steve Renals,et al.  INTERSPEECH 2010 11th Annual Conference of the International Speech Communication Association , 2010 .

[5]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Chris Eliasmith,et al.  Deep networks for robust visual recognition , 2010, ICML.

[9]  Rui Xia,et al.  Sentence level emotion recognition based on decisions from subsentence segments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Björn W. Schuller,et al.  Likability Classification - A Not so Deep Neural Network Approach , 2012, INTERSPEECH.

[11]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Hynek Hermansky,et al.  Sparse Multilayer Perceptron for Phoneme Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[16]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[17]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[19]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Douglas D. O'Shaughnessy,et al.  Multiple windowed spectral features for emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[23]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[24]  Ragini Verma,et al.  Combining Ranking and Classification to Improve Emotion Recognition in Spontaneous Speech , 2012, INTERSPEECH.

[25]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[26]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[27]  Youngmoo E. Kim,et al.  Learning emotion-based acoustic features with deep belief networks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[28]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[29]  Mike E. Davies,et al.  IEEE International Conference on Acoustics Speech and Signal Processing , 2008 .

[30]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[31]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[32]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[33]  Björn W. Schuller,et al.  Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm , 2010, INTERSPEECH.

[34]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.