On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common

Without doubt, there is emotional information in almost any kind of sound received by humans every day: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow’s pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of “the sound that something makes,” in order to evaluate the system’s auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Johan Sundberg,et al.  Differences in Ability of Musicians and Nonmusicians to Judge Emotional State from the Fundamental Frequency of Voice Samples , 1985 .

[3]  M. Lévesque Perception , 1986, The Yale Journal of Biology and Medicine.

[4]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[5]  Klaus R. Scherer,et al.  Emotion expression in speech and music , 1991 .

[6]  Steve Young,et al.  The HTK book , 1995 .

[7]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[8]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[9]  P. Laukka,et al.  Communication of emotions in vocal expression and music performance: different channels, same code? , 2003, Psychological bulletin.

[10]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[11]  Blockin,et al.  Vocal Expression of Emotion , 2004 .

[12]  E. Schellenberg,et al.  Decoding speech prosody: do music lessons help? , 2004, Emotion.

[13]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[14]  Stephen V. Rice,et al.  A Web Search Engine for Sound Effects , 2005 .

[15]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[16]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[17]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[18]  W. Thompson,et al.  A Comparison of Acoustic Cues in Music and Speech for Three Dimensions of Affect , 2006 .

[19]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[20]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[21]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Mert Bay,et al.  The 2007 MIREX Audio Mood Classification Task: Lessons Learned , 2008, ISMIR.

[23]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[24]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[27]  Robert Schleicher,et al.  Towards evaluation of example-based audio retrieval system using affective dimensions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[28]  Björn W. Schuller,et al.  Determination of Nonprototypical Valence and Arousal in Popular Music: Features and Performances , 2010, EURASIP J. Audio Speech Music. Process..

[29]  Akinori Ito,et al.  A System for Evaluating Singing Enthusiasm for Karaoke , 2011, ISMIR.

[30]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[31]  Björn W. Schuller,et al.  Multi-Modal Non-Prototypical Music Mood Analysis in Continuous Space: Reliability and Performances , 2011, ISMIR.

[32]  K. Scherer,et al.  Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception. , 2012, Emotion.

[33]  Andreas Floros,et al.  Affective acoustic ecology: towards emotionally enhanced sound events , 2012, Audio Mostly Conference.

[34]  Yi-Hsuan Yang,et al.  Machine Recognition of Music Emotion: A Review , 2012, TIST.

[35]  Shrikanth S. Narayanan,et al.  A Robust Unsupervised Arousal Rating Framework using Prosody with Cross-Corpora Evaluation , 2012, INTERSPEECH.

[36]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[37]  Klaus R. Scherer,et al.  Advocating a Componential Appraisal Model to Guide Emotion Recognition , 2012, Int. J. Synth. Emot..

[38]  Björn W. Schuller,et al.  Automatic recognition of emotion evoked by general sound events , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Klaus R. Scherer Emotion in Action, Interaction, Music, and Speech , 2013 .

[40]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.