Prediction-based learning for continuous emotion recognition in speech

In this paper, a prediction-based learning framework is proposed for a continuous prediction task of emotion recognition from speech, which is one of the key components of affective computing in multimedia. The main goal of this framework is to utmost exploit the individual advantages of different regression models cooperatively. To this end, we take two widely used regression models for example, i. e., support vector regression and bidirectional long short-term memory recurrent neural network. We concatenate the two models in a tandem structure by different ways, forming a united cascaded framework. The outputs predicted by the former model are combined together with the original features as the input of the following model for final predictions. The experimental results on a time- and value-continuous spontaneous emotion database (RECOLA) show that, the prediction-based learning framework significantly outperforms the individual models for both arousal and valence dimensions, and provides significantly better results in comparison to other state-of-the-art methodologies on this corpus.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[3]  Johanna D. Moore,et al.  Emotion recognition in spontaneous and acted dialogues , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[4]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[5]  Björn W. Schuller,et al.  Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory , 2013, Comput. Speech Lang..

[6]  R. Brereton,et al.  Support vector machines for classification and regression. , 2010, The Analyst.

[7]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[9]  Björn W. Schuller,et al.  Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[12]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[13]  Dongmei Jiang,et al.  Multimodal continuous affect recognition based on LSTM and multiple kernel learning , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Ting Dang,et al.  An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction , 2015, AVEC@ACM Multimedia.

[16]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[17]  Fabien Ringeval,et al.  Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio , 2016, IJCAI.

[18]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[19]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[20]  Hatice Gunes,et al.  Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[21]  Maja Pantic,et al.  Prediction-Based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations , 2016, IEEE Transactions on Affective Computing.

[22]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[23]  James T. Kwok,et al.  Constructive algorithms for structure learning in feedforward neural networks for regression problems , 1997, IEEE Trans. Neural Networks.

[24]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[25]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[26]  Le Zhang,et al.  Ensemble deep learning for regression and time series forecasting , 2014, 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL).

[27]  Hatice Gunes,et al.  Output-associative RVM regression for dimensional and continuous emotion prediction , 2011, Face and Gesture 2011.