Abstract The automated recognition of emotions from speech is a chal-lenging issue. In order to build an emotion recognizer well de-fined features and optimized parameter sets are essential. Thispaper will show how an optimal parameter set for HMM-basedrecognizerscanbefoundbyapplyinganevolutionaryalgorithmon standard features in automated speech recognition. For this,we compared different signal features, as well as several archi-tectures of HMMs. The system was evaluated on a non-acteddatabase and its performance was compared to a baseline sys-tem. We present an optimal feature set for the public part of theSmartKom database.Index Terms: Emotion Recognition, Evolutionary Algorithms,Feature Optimization, Hidden-Markov Models 1. Introduction The interaction between men and machines using language isnowadays becoming more and more self-evident, but machinesstill lack of many human abilities which would considerablysimplify communication and would also help to increase theacceptance of such systems. For some time, research activitiesalso focus stronger on the emotional aspect of speech. Exploit-ing information about the emotional state of a user, machinescan be enabled to adapt their dialog strategy online, depend-ing on the user’s emotions and hence react in a more appro-priate and empathic manner. As emotion recognition in manyapplications goes hand in hand with automated speech recog-nition (ASR) it would be favorable to make use of the samefeatures or even a subset thereof. Especially small devices likesmart phones or PDAs, which do not provide huge computa-tionalpowerwouldbenefitfromsuchasparsefeatureapproach.Parallel research of other groups bases on pooling together(high level) features including the application of brute forcemethods in order to fully exploit the feature space (compare[1]). This paper however describes an evolutionary strategy(ES) of finding an optimal sparse feature set, given not morethan the common acoustic features used in ASR. Applying anES has the advantage of self adaptation of its parameters andis able to find optimal parameter constellations in high dimen-sional search spaces and further gives insights into the rele-vance of each parameter in terms of the model’s accuracy. Es-pecially in case of many parameters with unknown relation-ships ES quickly avoids wasting time on generating and test-ingunsuitableparametercombinationsastheevolutionaryforceminimizes the probability of the evolvement of such combina-tions effectively. In ASR Mel-Frequency-Cepstral-Coefficients(MFCCs) have established as a basic feature in order to trainphoneme based recognizers. As we do not want additional pa-rameters to be extracted from the speech signal we concentrateonly on MFCCs, which have also proven to perform well inemotion recognition during the Emotion Challenge within In-terspeech 2009 (compare [2]).Thispaperisstructuredasfollows: InSection2wedescribethespontaneous database and the emotions we want to recognize.Section 3 describes, which parameters we investigated and inwhich range they were allowed to change during the evolutionprocess. Section4introducestheevolutionaryalgorithm,showshow fitness is measured and the population evolves over time.TheresultsarepresentedinSection5andcomparedtoourbase-line recognizer. Finally Section 6 summarizes our findings andgives an outlook.
[1]
Björn W. Schuller,et al.
On the Influence of Phonetic Content Variation for Acoustic Emotion Recognition
,
2008,
PIT.
[2]
Elisabeth André,et al.
Improving Automatic Emotion Recognition from Speech via Gender Differentiaion
,
2006,
LREC.
[3]
Wolfgang Wahlster,et al.
SmartKom: Foundations of Multimodal Dialogue Systems
,
2006,
SmartKom.
[4]
Elmar Nöth,et al.
We are not amused - but how do you know? user states in a multi-modal dialogue system
,
2003,
INTERSPEECH.
[5]
Loïc Kessous,et al.
The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals
,
2007,
INTERSPEECH.