Transfer learning emotion manifestation across music and speech

In this article, we focus on time-continuous predictions of emotion in music and speech, and the transfer of learning from one domain to the other. First, we compare the use of Recurrent Neural Networks (RNN) with standard hidden units (Simple Recurrent Network - SRN) and Long-Short Term Memory (LSTM) blocks for intra-domain acoustic emotion recognition. We show that LSTM networks outperform SRN, and we explain, in average, 74%/59% (music) and 42%/29% (speech) of the variance in Arousal/Valence. Next, we evaluate whether cross-domain predictions of emotion are a viable option for acoustic emotion recognition, and we test the use of Transfer Learning (TL) for feature space adaptation. In average, our models are able to explain 70%/43% (music) and 28%/ll% (speech) of the variance in Arousal/Valence. Overall, results indicate a good cross-domain generalization performance, particularly for the model trained on speech and tested on music without pre-encoding of the input features. To our best knowledge, this is the first demonstration of cross-modal time-continuous predictions of emotion in the acoustic domain.

[1]  Eduardo Coutinho,et al.  Music, Speech and Emotion: Psycho-physiological and Computational Investigations , 2010 .

[2]  J. Russell A circumplex model of affect. , 1980 .

[3]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[4]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[5]  Emery Schubert Modeling Perceived Emotion With Continuous Musical Features , 2004 .

[6]  Eduardo Coutinho,et al.  The Use of Spatio-Temporal Connectionist Models in Psychological Studies of Musical Emotions , 2009 .

[7]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[8]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[9]  Björn W. Schuller,et al.  Online Driver Distraction Detection Using Long Short-Term Memory , 2011, IEEE Transactions on Intelligent Transportation Systems.

[10]  William Forde Thompson,et al.  Experiential and cognitive changes following seven minutes exposure to music and speech , 2011 .

[11]  Eduardo Coutinho,et al.  Computational and psycho-physiological investigations of musical emotions , 2008 .

[12]  J. Sloboda,et al.  Handbook of Music and Emotion: Theory, Research, Applications , 2011 .

[13]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[14]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[15]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[16]  A. Cangelosi,et al.  Musical emotions: predicting second-by-second subjective feelings of emotion from low-level psychoacoustic features and physiological measurements. , 2011, Emotion.

[17]  H. Fastl,et al.  Dynamic Loudness Model (DLM) for Normal and Hearing-Impaired Listeners , 2002 .

[18]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[19]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[20]  A. Gabrielsson,et al.  The role of structure in the musical expression of emotions , 2010 .

[21]  Piet Mertens,et al.  The Prosogram: Semi-Automatic Transcription of Prosody Based on a Tonal Perception Model , 2004 .

[22]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[24]  P. Laukka,et al.  Communication of emotions in vocal expression and music performance: different channels, same code? , 2003, Psychological bulletin.

[25]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[26]  Eduardo Coutinho,et al.  Psychoacoustic cues to emotion in speech prosody and music , 2013, Cognition & emotion.

[27]  W. Aures,et al.  Ein Berechnungsverfahren der Rauhigkeit , 1985 .

[28]  Nivja H. Jong,et al.  Praat script to detect syllable nuclei and measure speech rate automatically , 2009, Behavior research methods.

[29]  Björn W. Schuller,et al.  Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario , 2011, TSLP.

[30]  Simon Dixon,et al.  Evaluation of the Audio Beat Tracking System BeatRoot , 2007 .