Energy and F0 contour modeling with functional data analysis for emotional speech detection

This paper proposes the use of reference models to detect emotional prominence in the energy and F0 contours. The proposed framework aims to model the intrinsic variability of these prosodic features. We present a novel approach based on Functional Data Analysis (FDA) to build reference models using a family of energy and F0 contours, which are implemented with lexicon-independent models. The neutral models are represented by bases of functions and the testing energy and F0 contours are characterized by their projections onto the corresponding bases. The proposed system can lead to accuracies as high as 80.4% in binary emotion classification in the EMODB corpus, which is 17.6% higher than the one achieved by a benchmark classifier trained with sentence level prosodic features. The approach is also evaluated with the SEMAINE corpus, showing that it can be effectively used in real applications. Index Terms: Emotion detection, prosody modeling, emotional speech analysis, expressive speech, functional data analysis.

[1]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Carlos Busso,et al.  A personalized emotion recognition system using an unsupervised feature adaptation scheme , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[5]  Julia Hirschberg,et al.  Classifying subject ratings of emotional speech using acoustic features , 2003, INTERSPEECH.

[6]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[7]  Néstor Becerra Yoma,et al.  Automatic intonation assessment for computer aided language learning , 2010, Speech Commun..

[8]  Sheldon B. Michaels,et al.  Some Aspects of Fundamental Frequency and Envelope Amplitude as Related to the Emotional Content of Speech , 1962 .

[9]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[10]  Astrid Paeschke,et al.  Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements , 2000 .

[11]  Michele Gubian,et al.  Redescribing intonational categories with functional data analysis , 2010, INTERSPEECH.

[12]  Diane J. Litman,et al.  Using word-level pitch features to better predict student emotions during spoken tutoring dialogues , 2005, INTERSPEECH.

[13]  Lou Boves,et al.  Functional data analysis as a tool for analyzing speech dynamics - a case study on the French word c'était , 2009, INTERSPEECH.

[14]  Hans-Georg Müller,et al.  Functional Data Analysis , 2016 .

[15]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[16]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[17]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[18]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[19]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[20]  Li-chiung Yang,et al.  The expression and recognition of emotions through prosody , 2000, INTERSPEECH.

[21]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[22]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[23]  Nadia Bianchi-Berthouze,et al.  Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models , 2011, ACII.