Emotional speech classification using hidden conditional random fields

Although there have been a great number of papers in the area of emotional speech recognition, most of them contribute to the feature extraction phase. Regarding classification algorithm, hidden Markov model (HMM) is still the most commonly used method. Whereas HMM was pointed out to be less accurate than its discriminative counterpart, the hidden conditional random fields (HCRF) model, for example in phone classification or gesture recognition. Therefore in this study, we investigate the use of the HCRF model in emotional speech classification problem. In our experiments, we extracted Mel-frequency cepstral coefficients (MFCC) features from the well-known Berlin emotional speech dataset (EMO) and eNTERFACE 2005 dataset. After that, we used the 10-fold cross validation rule to train, evaluate and compare our result with that of HMM. The experiments show that HCRF achieves significant improvement (p-value ≤ 0.05) in classification accuracy. In addition, we speed up the training phase of the model by caching the gradient computation. Therefore our computation time is much less than that of the existing methods.

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  John H. L. Hansen,et al.  Nonlinear analysis and classification of speech under stressed conditions , 1994 .

[3]  Paul Lukowicz,et al.  Activity and emotion recognition to support early diagnosis of psychiatric diseases , 2008, 2008 Second International Conference on Pervasive Computing Technologies for Healthcare.

[4]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[5]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[6]  Ragini Verma,et al.  Class-level spectral features for emotion recognition , 2010, Speech Commun..

[7]  João Paulo Papa,et al.  Spoken emotion recognition through optimum-path forest classification using glottal features , 2010, Comput. Speech Lang..

[8]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[9]  Arman Savran,et al.  Multimodal caricatural mirror , 2005 .

[10]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[11]  Yuqing Gao,et al.  Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  M. Schervish P Values: What They are and What They are Not , 1996 .

[13]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[14]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.