On-line continuous-time music mood regression with deep recurrent neural networks

This paper proposes a novel machine learning approach for the task of on-line continuous-time music mood regression, i.e., low-latency prediction of the time-varying arousal and valence in musical pieces. On the front-end, a large set of segmental acoustic features is extracted to model short-term variations. Then, multi-variate regression is performed by deep recurrent neural networks to model longer-range context and capture the time-varying emotional profile of musical pieces appropriately. Evaluation is done on the 2013 MediaEval Challenge corpus consisting of 1000 pieces annotated in continous time and continuous arousal and valence by crowd-sourcing. In the result, recurrent neural networks outperform SVR and feedforward neural networks both in continuous-time and static music mood regression, and achieve an R2 of up to .70 and .50 with arousal and valence annotations.

[1]  J. Russell A circumplex model of affect. , 1980 .

[2]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[3]  Yueting Zhuang,et al.  Popular music retrieval by detecting mood , 2003, SIGIR.

[4]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[5]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[6]  Yi-Hsuan Yang,et al.  A Regression Approach to Music Emotion Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Seungmin Rho,et al.  SVR-based music mood classification and context-based music recommendation , 2009, ACM Multimedia.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Björn W. Schuller,et al.  Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks , 2010, ISMIR.

[10]  Youngmoo E. Kim,et al.  Prediction of Time-varying Musical Mood Distributions from Audio , 2010, ISMIR.

[11]  Eduardo Coutinho,et al.  A neural network model for the prediction of musical emotions , 2010 .

[12]  Youngmoo E. Kim,et al.  Modeling Musical Emotion Dynamics with Conditional Random Fields , 2011, ISMIR.

[13]  Björn W. Schuller,et al.  Multi-Modal Non-Prototypical Music Mood Analysis in Continuous Space: Reliability and Performances , 2011, ISMIR.

[14]  Jeffrey J. Scott,et al.  Feature Learning in Dynamic Environments: Modeling the Acoustic Structure of Musical Emotion , 2012, ISMIR.

[15]  Björn W. Schuller,et al.  A multitask approach to continuous five-dimensional affect sensing in natural speech , 2012, TIIS.

[16]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Frank A. Russo,et al.  Predicting Emotion from Music Audio Features Using Neural Networks , 2012 .

[18]  Yi-Hsuan Yang,et al.  1000 songs for emotional analysis of music , 2013, CrowdMM '13.

[19]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[20]  Björn W. Schuller,et al.  The TUM Approach to the MediaEval Music Emotion Task Using Generic Affective Audio Features , 2013, MediaEval.

[21]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..