Recognizing emotion from singing and speaking using shared models

Speech and song are two types of vocal communications that are closely related to each other. While significant progress has been made in both speech and music emotion recognition, few works have concentrated on building a shared emotion recognition model for both speech and song. In this paper, we propose three shared emotion recognition models for speech and song: a simple model, a single-task hierarchical model, and a multi-task hierarchical model. We study the commonalities and differences present in emotion expression across these two communication domains. We compare the performance across different settings, investigate the relationship between evaluator agreement rate and classification accuracy, and analyze the classification performance of individual feature groups. Our results show that the multi-task model classifies emotion more accurately compared to single-task models when the same set of features is used. This suggests that although spoken and sung emotion recognition tasks are different, they are related, and can be considered together. The results demonstrate that utterances with lower agreement rate and emotions with low activation benefit the most from multi-task learning. Visual features appear to be more similar across spoken and sung emotion expression, compared to acoustic features.

[1]  Chris Baume Evaluation of Acoustic Features for Music Emotion Recognition , 2013 .

[2]  William Forde Thompson,et al.  Experiential and cognitive changes following seven minutes exposure to music and speech , 2011 .

[3]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[4]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[5]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  W. Thompson,et al.  A Comparison of Acoustic Cues in Music and Speech for Three Dimensions of Affect , 2006 .

[8]  Johan Sundberg,et al.  Comparing the acoustic expression of emotion in the speaking and the singing voice , 2015, Comput. Speech Lang..

[9]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[10]  Gwen Littlewort,et al.  The computer expression recognition toolbox (CERT) , 2011, Face and Gesture 2011.

[11]  Jeffrey J. Scott,et al.  MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEW , 2010 .

[12]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Carlos Busso,et al.  Visual emotion recognition using compact facial representations and viseme information , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  P. Laukka,et al.  Communication of emotions in vocal expression and music performance: different channels, same code? , 2003, Psychological bulletin.

[15]  David A. McAllester,et al.  Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence , 2009, UAI 2009.

[16]  Frank A. Russo,et al.  Acoustic differences in the speaking and singing voice , 2013 .

[17]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[18]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[19]  Charles A. Micchelli,et al.  Kernels for Multi--task Learning , 2004, NIPS.

[20]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[21]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[22]  Yi-Hsuan Yang,et al.  Machine Recognition of Music Emotion: A Review , 2012, TIST.

[23]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[24]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[25]  Christian Kroos,et al.  Singing emotionally: a study of pre-production, production, and post-production facial expressions , 2014, Front. Psychol..

[26]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[27]  Ulrich H.-G. Kreßel,et al.  Pairwise classification and support vector machines , 1999 .

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[30]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[31]  L. Stamou Plato and Aristotle on Music and Music Education: Lessons From Ancient Greece , 2002 .

[32]  E. Ross,et al.  Human Facial Expressions Are Organized Functionally Across the Upper-Lower Facial Axis , 2007, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[33]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[34]  Emily Mower Provost,et al.  Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition , 2014, ACM Multimedia.

[35]  Shiliang Sun,et al.  Multitask Multiclass Support Vector Machines , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[36]  Nicu Sebe,et al.  Emotion Recognition Based on Joint Visual and Audio Cues , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[37]  Carlos Busso,et al.  The expression and perception of emotions: comparing assessments of self versus others , 2008, INTERSPEECH.

[38]  Emily Mower Provost,et al.  Predicting Emotion Perception Across Domains: A Study of Singing and Speaking , 2015, AAAI.

[39]  Marcelo M. Wanderley,et al.  Common cues to emotion in the dynamic facial expressions of speech and song , 2014, Quarterly journal of experimental psychology.