论文信息 - Segment-based speech emotion recognition using recurrent neural networks

Segment-based speech emotion recognition using recurrent neural networks

Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate time-scale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In this paper, we investigate both local and global features and evaluate the performance at various time-scales (frame, phoneme, word or utterance). We show that for RNN models, extracting statistical functionals over speech segments that roughly correspond to the duration of a couple of words produces optimal accuracy. We report state-of-the-art SER performance on the IEMOCAP corpus at a significantly lower model and computational complexity.

Efthymios Tzinis | Alexandros Potamianos | A. Potamianos | Efthymios Tzinis

[1] Razvan Pascanu,et al. Theano: Deep Learning on GPUs with Python , 2012 .

[2] Shashidhar G. Koolagudi,et al. Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[3] Björn W. Schuller,et al. The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[4] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .

[5] Emily Mower Provost,et al. Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6] Chung-Hsien Wu,et al. Hierarchical modeling of temporal course in emotional expression for speech emotion recognition , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[7] Zhong-Qiu Wang,et al. Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Albino Nogueiras,et al. Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[9] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[10] Dong Yu,et al. Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[11] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[12] Björn W. Schuller,et al. Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[13] Louis-Philippe Morency,et al. Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[14] Carlos Busso,et al. Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[16] Björn W. Schuller,et al. Modeling gender information for emotion recognition using Denoising autoencoder , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Maja J. Mataric,et al. A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18] Seyedmahdad Mirsamadi,et al. Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[20] Jinkyu Lee,et al. High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[21] Christian Osendorfer,et al. On Fast Dropout and its Applicability to Recurrent Networks , 2013, ICLR.

[22] Alexandros Potamianos,et al. Speech Emotion Recognition Using Affective Saliency , 2016, INTERSPEECH.

[23] Fakhri Karray,et al. Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[24] Che-Wei Huang,et al. Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[25] Margaret Lech,et al. Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[26] Say Wei Foo,et al. Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[27] I. Tashev,et al. UTTERANCE-LEVEL REPRESENTATIONS FOR SPEECH EMOTION AND AGE / GENDER RECOGNITION USING DEEP NEURAL NETWORKS , 2017 .

[28] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.