TUNING HIDDEN MARKOV MODEL FOR SPEECH EMOTION RECOGNITION

In this article we introduce a speech emotion recognition method based on Hidden Markov Models (HMM). Lowlevel features, which are quite popular in Automatic Speech Recognition systems, are used in this method. Two strategies are considered and compared in this paper. Within the first strategy a one-from-all recognition model for each emotional state is constructed. A second strategy is a one-against-other recognition where each emotional state has its own model and background model (model for other emotional states). Optimal values of the number of HMM-states and the number of Gaussian mixture components that increase robustness of speech emotion recognition system were found. For proof-of-concept experiments we use the Berlin Database of Emotional Speech (EMO-DB). Results in recognition of seven discrete emotions exceeded 83% recognition rate. As a basis of comparison the similar judgment of human decision makers rating the naturalness of emotion for the same corpus at 78.83% recognition rate was compared. Introduction Applications of emotional speech recognition can be foreseen in the broad area of human-computer interaction or in the field of security systems. In our research emotion recognition is applied within office environment. Focusing on the field of man machine interaction non-invasive advances seem more popular in recent works due to a user’s control of the emotion shown and a certain comfort provided by the noninvasive nature. Speech analysis seem to be one of the most promising, we focus on speech as input channel in this work. Most of the advances to speech emotion recognition rely on acoustic characteristics of an emotional spoken utterance. We decide to apply automatic speech recognition (ASR) methods for task of emotion recognition with in speech. One HMM for each predefined emotional state was trained. For testing we will use public database Berlin Database of Emotional Speech (EMO-DB). The paper is structured as follows: Section 2 deals with feature extraction and HMM model specification. Section 3 introduces the emotional database which are used for experiments. Finally, in the sections 4-8 experiment, conclusion, feature work, acknowledgments are presented. Emotion Modeling Feature extraction Input speech signals are processed using a 25ms Hamming window, with a frame rate of 10ms. We use a 39 dimensional feature vector per each frame consisting of 12 MFCC and log energy over the frame plus delta and acceleration coefficients. Cepstral Mean Substraction (CMS) and variance normalization are applied to better cope with channel characteristics. HMM modeling To estimate a user’s emotion by the speech signal we use HMMs. Two states HMMs and single state HMMs, also called a Gaussian Mixture Model (GMM), have been used. The objective of the emotion recognition task is to find a emotion model λi given the set of reference models Λ={λ1,..., λN} and sequence of test vectors X = {x1,..., xN } which gives the maximum a posteriori probability P(λ | X). This requires the calculation of all P(λj | X) , j = 1,...,N and finding the maximum among them. In our task, it is possible to use the likelihood P(X | λ) instead of P(λ | X) which does not require prior probabilities P(λ ) to be known. The vectors in a sequence X, are independent and identically distributed random variables. We can define P(X | λ) as

[1]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.