Relevance units machine based dimensional and continuous speech emotion prediction

Emotion plays a significant role in human-computer interaction. The continuing improvements in speech technology have led to many new and fascinating applications in human-computer interaction, context aware computing and computer mediated communication. Such applications require reliable online recognition of the user’s affect. However most emotion recognition systems are based on speech via an isolated short sentence or word. We present a framework for online emotion recognition from speech. On the front-end, a voice activity detection algorithm is used to segment the input speech, and features are estimated to model long-term properties. Then, dimensional and continuous emotion recognition is performed via a Relevance Units Machine (RUM). The advantages of the proposed system are: (i) its computational efficiency in run-time (regression outputs can be produced continuously in pseudo real-time), (ii) RUM offers superior sparsity to the well-known Support Vector Regression (SVR) and Relevance Vector Machine for regression (RVR), and (iii) RUM’s predictive performance is comparable to SVR and RVR.

[1]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[2]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[3]  K. Scherer,et al.  Cue utilization in emotion attribution from auditory stimuli , 1977 .

[4]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[5]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[6]  Dongrui Wu,et al.  Speech emotion estimation in 3D space , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[7]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[8]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[9]  K. Scherer,et al.  Appraisal processes in emotion: Theory, methods, research. , 2001 .

[10]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[13]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  J. Russell A circumplex model of affect. , 1980 .

[15]  Cynthia Whissell,et al.  THE DICTIONARY OF AFFECT IN LANGUAGE , 1989 .

[16]  Léon J. M. Rothkrantz,et al.  EmoReSp: an online emotion recognizer based on speech , 2010, CompSysTech '10.

[17]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[18]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[19]  K. Kroschel,et al.  Emotion Estimation in Speech Using a 3D Emotion Space Concept , 2007 .

[20]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[21]  Jun Zhang,et al.  Sparse Kernel Learning and the Relevance Units Machine , 2009, PAKDD.

[22]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[23]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[24]  Hatice Gunes,et al.  From the Lab to the real world: affect recognition using multiple cues and modalities , 2008 .

[25]  Roddy Cowie,et al.  AVEC 2012: the continuous audio/visual emotion challenge - an introduction , 2012, ICMI.

[26]  Rafael A. Calvo,et al.  Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications , 2010, IEEE Transactions on Affective Computing.

[27]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[28]  Elisabeth André,et al.  EmoVoice - A Framework for Online Recognition of Emotions from Voice , 2008, PIT.

[29]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[30]  Charles A. Bouman,et al.  CLUSTER: An Unsupervised Algorithm for Modeling Gaussian Mixtures , 2014 .

[31]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[32]  J. Borod The Neuropsychology of emotion , 2000 .

[33]  Werner Verhelst,et al.  On Noise Robust Voice Activity Detection , 2011, INTERSPEECH.

[34]  Yi-Ping Phoebe Chen,et al.  Acoustic feature selection for automatic emotion recognition from speech , 2009, Inf. Process. Manag..

[35]  G. Huttar Relations between prosodic variables and emotions in normal American English utterances. , 1968, Journal of speech and hearing research.

[36]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[37]  Wade Junek,et al.  Mind Reading: The Interactive Guide to Emotions , 2007 .

[38]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[39]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[40]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[41]  Hatice Gunes,et al.  Output-associative RVM regression for dimensional and continuous emotion prediction , 2011, Face and Gesture 2011.

[42]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..