Enhanced Autocorrelation in Real World Emotion Recognition

Multimodal emotion recognition in real world environments is still a challenging task of affective computing research. Recognizing the affective or physiological state of an individual is difficult for humans as well as for computer systems, and thus finding suitable discriminative features is the most promising approach in multimodal emotion recognition. In the literature numerous features have been developed or adapted from related signal processing tasks. But still, classifying emotional states in real world scenarios is difficult and the performance of automatic classifiers is rather limited. This is mainly due to the fact that emotional states can not be distinguished by a well defined set of discriminating features. In this work we present an enhanced autocorrelation feature as a multi pitch detection feature and compare its performance to feature well known, and state-of-the-art in signal and speech processing. Results of the evaluation show that the enhanced autocorrelation outperform other state-of-the-art features in case of the challenge data set. The complexity of this benchmark data set lies in between real world data sets showing naturalistic emotional utterances, and the widely applied and well-understood acted emotional data sets.

[1]  Günther Palm,et al.  Multi-modal Fusion based on classifiers using reject options and Markov Fusion Networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[2]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[3]  Elmar Nöth,et al.  The Recognition of Emotion , 2000 .

[4]  Günther Palm,et al.  Sensor-Fusion in Neural Networks , 2009 .

[5]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[6]  Sascha Meudt,et al.  Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech , 2013, ICMI '13.

[7]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[8]  Sascha Meudt,et al.  On Instance Selection in Audio Based Emotion Recognition , 2012, ANNPR.

[9]  Günther Palm,et al.  Real-Time Emotion Recognition from Speech Using Echo State Networks , 2008, ANNPR.

[10]  Takeo Kanade,et al.  Facial Expression Analysis , 2011, AMFG.

[11]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[12]  Dietmar F. Rösner,et al.  LAST MINUTE: a Multimodal Corpus of Speech-based User-Companion Interactions , 2012, LREC.

[13]  Christos Faloutsos,et al.  Fast feature selection using fractal dimension , 2010, J. Inf. Data Manag..

[14]  Takeo Kanade,et al.  Comprehensive database for facial expression analysis , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[15]  Friedhelm Schwenker,et al.  Multiple Classifier Systems for the Recogonition of Human Emotions , 2010, MCS.

[16]  Günther Palm,et al.  Multiple classifier combination using reject options and markov fusion networks , 2012, ICMI '12.

[17]  Günther Palm,et al.  Emotion Recognition from Speech Using Multi-Classifier Systems and RBF-Ensembles , 2008, Speech, Audio, Image and Biomedical Signal Processing using Neural Networks.

[18]  Friedhelm Schwenker,et al.  Multimodal Emotion Classification in Naturalistic User Behavior , 2011, HCI.

[19]  Sascha Meudt,et al.  Multi-Modal Classifier-Fusion for the Recognition of Emotions , 2013 .

[20]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[21]  Markus Kächele,et al.  Using unlabeled data to improve classification of emotional states in human computer interaction , 2013, Journal on Multimodal User Interfaces.

[22]  Frank Dellaert,et al.  Recognizing emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[23]  Sascha Meudt,et al.  Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression , 2014, ICPRAM.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[26]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[27]  Fernando De la Torre,et al.  Facial Expression Analysis , 2011, Visual Analysis of Humans.

[28]  Friedhelm Schwenker,et al.  Conditioned Hidden Markov Model Fusion for Multimodal Classification , 2011, INTERSPEECH.

[29]  G. Palm,et al.  Classifier fusion for emotion recognition from speech , 2007 .

[30]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[31]  J. Allwood A Framework for Studying Human Multimodal Communication , 2013 .

[32]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..

[33]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[35]  S. Schachter The Interaction of Cognitive and Physiological Determinants of Emotional State , 1964 .

[36]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[37]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.