Maximum Echo-State-Likelihood Networks for Emotion Recognition

Emotion recognition is a relevant task in human-computer interaction. Several pattern recognition and machine learning techniques have been applied so far in order to assign input audio and/or video sequences to specific emotional classes. This paper introduces a novel approach to the problem, suitable also to more generic sequence recognition tasks. The approach relies on the combination of the recurrent reservoir of an echo state network with a connectionist density estimation module. The reservoir realizes an encoding of the input sequences into a fixed-dimensionality pattern of neuron activations. The density estimator, consisting of a constrained radial basis functions network, evaluates the likelihood of the echo state given the input. Unsupervised training is accomplished within a maximum-likelihood framework. The architecture can then be used for estimating class-conditional probabilities in order to carry out emotion classification within a Bayesian setup. Preliminary experiments in emotion recognition from speech signals from the WaSeP© dataset show that the proposed approach is effective, and it may outperform state-of-the-art classifiers.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Günther Palm,et al.  Real-Time Emotion Recognition from Speech Using Echo State Networks , 2008, ANNPR.

[6]  Hynek Hermansky,et al.  Perceptually based linear predictive analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  G. Palm,et al.  Classifier fusion for emotion recognition from speech , 2007 .

[8]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[9]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[10]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[11]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[12]  Günther Palm,et al.  Multimodal Laughter Detection in Natural Discourses , 2009, Human Centered Robot Systems, Cognition, Interaction, Technology.

[13]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[14]  D. W. Robinson,et al.  A re-determination of the equal-loudness relations for pure tones , 1956 .