论文信息 - EmoRL: Continuous Acoustic Emotion Classification Using Deep Reinforcement Learning - 字舞流文

EmoRL: Continuous Acoustic Emotion Classification Using Deep Reinforcement Learning

Acoustically expressed emotions can make communication with a robot more efficient. Detecting emotions like anger could provide a clue for the robot indicating unsafe/undesired situations. Recently, several deep neural network-based models have been proposed which establish new state-of-the-art results in affective state evaluation. These models typically start processing at the end of each utterance, which not only requires a mechanism to detect the end of an utterance but also makes it difficult to use them in a real-time communication scenario, e.g. human-robot interaction. We propose the EmoRL model that triggers an emotion classification as soon as it gains enough confidence while listening to a person speaking. As a result, we minimize the need for segmenting the audio signal for classification and achieve lower latency as the audio signal is processed incrementally. The method is competitive with the accuracy of a strong baseline model, while allowing much earlier prediction.

Stefan Wermter | Cornelius Weber | Sven Magg | Mohammad-Ali Zamani | Egor Lakomkin | C. Weber | S. Wermter | M. Zamani | S. Magg | Egor Lakomkin

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[3] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4] Richard Socher,et al. A Way out of the Odyssey: Analyzing and Combining Recent Insights for LSTMs , 2016, ArXiv.

[5] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[6] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Björn W. Schuller,et al. Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[8] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[9] Graham Neubig,et al. Learning to Translate in Real-time with Neural Machine Translation , 2016, EACL.

[10] Ilya Sutskever,et al. Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[11] Che-Wei Huang,et al. Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[12] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines - Revised , 2015 .

[13] Louis-Philippe Morency,et al. Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[14] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[17] Yelong Shen,et al. ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[18] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[21] Jinkyu Lee,et al. High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[22] Quoc V. Le,et al. Learning to Skim Text , 2017, ACL.

[23] Margaret Lech,et al. Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[24] Erik Strahl,et al. Smoke and mirrors — Virtual realities for sensor fusion experiments in biomimetic robotics , 2012, 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI).

[25] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.