论文信息 - Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network

Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network

Estimating continuous emotional states from speech as a function of time has traditionally been framed as a regression problem. In this paper, we present a novel approach that moves the problem into the classification domain by discretizing the training labels at different resolutions. We employ a multi-task deep bidirectional long-short term memory (BLSTM) recurrent neural network (RNN) trained with cost-sensitive Cross Entropy loss to model these labels jointly. We introduce an emotion decoding algorithm that incorporates longand short-term temporal properties of the signal to produce more robust time series estimates. We show that our proposed approach achieves competitive audio-only performance on the RECOLA dataset, relative to previously published works as well as other strong regression baselines. This work provides a link between regression and classification, and contributes an alternative approach for continuous emotion recognition.

[1] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Fabien Ringeval,et al. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[3] K. Scherer,et al. The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[4] Emily Mower Provost,et al. Predicting Emotion Perception Across Domains: A Study of Singing and Speaking , 2015, AAAI.

[5] Pavel Matejka,et al. Multimodal Emotion Recognition for AVEC 2016 Challenge , 2016, AVEC@ACM Multimedia.

[6] Shiliang Zhang,et al. Affective MTV analysis based on arousal and valence features , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[7] Cheung-Chi Leung,et al. Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[9] Björn W. Schuller,et al. Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[11] Björn W. Schuller,et al. Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Peter Bell,et al. Regularization of context-dependent deep neural networks with context-independent multi-task training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Hongying Meng,et al. Affective State Level Recognition in Naturalistic Facial and Vocal Expressions , 2014, IEEE Transactions on Cybernetics.

[16] Jasha Droppo,et al. Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Peter Bell,et al. Complementary tasks for context-dependent deep neural network acoustic models , 2015, INTERSPEECH.

[18] Yi-Hsuan Yang,et al. Prediction of the Distribution of Perceived Music Emotions Using Discrete Samples , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[20] Dongmei Jiang,et al. Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[21] Björn W. Schuller,et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[22] Björn W. Schuller,et al. Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[23] Hatice Gunes,et al. Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[24] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[25] Fabien Ringeval,et al. AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[26] Fabien Ringeval,et al. AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[27] William M. Campbell,et al. Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction , 2016, AVEC@ACM Multimedia.