Segment-based speech emotion recognition using recurrent neural networks

Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate time-scale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In this paper, we investigate both local and global features and evaluate the performance at various time-scales (frame, phoneme, word or utterance). We show that for RNN models, extracting statistical functionals over speech segments that roughly correspond to the duration of a couple of words produces optimal accuracy. We report state-of-the-art SER performance on the IEMOCAP corpus at a significantly lower model and computational complexity.

[1]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[2]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[3]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[4]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[5]  Emily Mower Provost,et al.  Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Chung-Hsien Wu,et al.  Hierarchical modeling of temporal course in emotional expression for speech emotion recognition , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[7]  Zhong-Qiu Wang,et al.  Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[13]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[14]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[16]  Björn W. Schuller,et al.  Modeling gender information for emotion recognition using Denoising autoencoder , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[20]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[21]  Christian Osendorfer,et al.  On Fast Dropout and its Applicability to Recurrent Networks , 2013, ICLR.

[22]  Alexandros Potamianos,et al.  Speech Emotion Recognition Using Affective Saliency , 2016, INTERSPEECH.

[23]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[24]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[25]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[26]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[27]  I. Tashev,et al.  UTTERANCE-LEVEL REPRESENTATIONS FOR SPEECH EMOTION AND AGE / GENDER RECOGNITION USING DEEP NEURAL NETWORKS , 2017 .

[28]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.