Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

The automatic recognition of spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be robust enough to capture the emotional content for various styles of speaking, and while on the other, machine learning algorithms need to be insensitive to outliers while being able to model the context. Whereas the latter has been tackled by the use of Long Short-Term Memory (LSTM) networks, the former is still under very active investigations, even though more than a decade of research has provided a large set of acoustic descriptors. In this paper, we propose a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation. In this novel work on the so-called end-to-end speech emotion recognition, we show that the use of the proposed topology significantly outperforms the traditional approaches based on signal processing techniques for the prediction of spontaneous and natural emotions on the RECOLA database.

[1]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[3]  Hans-Günter Hirsch,et al.  Improved speech recognition using high-pass filtering of subband envelopes , 1991, EUROSPEECH.

[4]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[5]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[6]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[7]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[8]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[9]  Hermann Ney,et al.  Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Music Transcription , 2015, ArXiv.

[13]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Richard Rose,et al.  Architectures for deep neural network based acoustic models defined over windowed speech waveforms , 2015, INTERSPEECH.

[15]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[16]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[17]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[18]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[19]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[21]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[23]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[24]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[25]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[26]  Patrick Thiam,et al.  Ensemble Methods for Continuous Affect Recognition: Multi-modality, Temporality, and Challenges , 2015, AVEC@ACM Multimedia.

[27]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[28]  Christian Biemann,et al.  Using representation learning and out-of-domain data for a paralinguistic speech task , 2015, INTERSPEECH.

[29]  C. Nickerson A note on a concordance correlation coefficient to evaluate reproducibility , 1997 .

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.