Emotion recognition using synthetic speech as neutral reference

A common approach to recognize emotion from speech is to estimate multiple acoustic features at sentence or turn level. These features are derived independent of the underlying lexical content. Studies have demonstrated that lexical dependent models improve emotion recognition accuracy. However, current practical approaches can only model small lexical units like phonemes, syllables or few key words, which limits these systems. We believe that building longer lexical models (i.e., sentence level model) is feasible by leveraging the advances in speech synthesis. Assuming that the transcript of the target speech is available, we synthesize speech conveying the same lexical information. The synthetic speech is used as a neutral reference model to contrast different acoustic features, unveiling local emotional changes. This paper introduces this novel framework and provides insights on how to compare the target and synthetic speech signals. Our evaluations demonstrate the benefits of synthetic speech as neutral reference to incorporate lexical dependencies in emotion recognition. The experimental results show that adding features derived from contrasting expressive speech with the proposed synthetic speech reference increases the accuracy in 2.1% and 2.8% (absolute) in classifying low versus high levels of arousal and valence, respectively.

[1]  David Philippou-Hübner,et al.  Vowels Formants Analysis Allows Straightforward Detection of High Arousal Acted and Spontaneous Emotions , 2011, INTERSPEECH.

[2]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[3]  Carlos Busso,et al.  Energy and F0 contour modeling with functional data analysis for emotional speech detection , 2013, INTERSPEECH.

[4]  Carlos Busso,et al.  Analysis and Compensation of the Reaction Lag of Evaluators in Continuous Emotional Annotations , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[5]  S. Levinson,et al.  Considerations in dynamic time warping algorithms for discrete word recognition , 1978 .

[6]  Björn W. Schuller,et al.  Learning with synthesized speech for automatic emotion recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[8]  Lijiang Chen,et al.  Relative Speech Emotion Recognition Based Artificial Neural Network , 2008, 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application.

[9]  Björn W. Schuller,et al.  Synthesized speech for model training in cross-corpus recognition of human emotion , 2012, International Journal of Speech Technology.

[10]  John H. L. Hansen,et al.  Feature analysis and neural network-based classification of speech under stress , 1996, IEEE Trans. Speech Audio Process..

[11]  Shrikanth Narayanan,et al.  Toward Effective Automatic Recognition Systems of Emotion in Speech , 2014 .

[12]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[13]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[14]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[15]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[16]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[17]  Carlos Busso,et al.  Compensating for speaker or lexical variabilities in speech for emotion recognition , 2014, Speech Commun..

[18]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[19]  Ingo Siegert,et al.  Vowels formants analysis allows straightforward detection of high arousal emotions , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[20]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[21]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[22]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[23]  Shashidhar G. Koolagudi,et al.  Text Independent Emotion Recognition Using Spectral Features , 2011, IC3.

[24]  Carlos Busso,et al.  Unveiling the Acoustic Properties that Describe the Valence Dimension , 2012, INTERSPEECH.

[25]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[26]  Carlos Busso,et al.  Shape-based modeling of the fundamental frequency contour for emotion detection in speech , 2014, Comput. Speech Lang..

[27]  Carlos Busso,et al.  Evaluation of syllable rate estimation in expressive speech and its contribution to emotion recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[28]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..