Segmented-Memory Recurrent Neural Networks versus Hidden Markov Models in Emotion Recognition from Speech

Emotion recognition from speech means to determine the emotional state of a speaker from his or her voice. Today’s most used classifiers in this field are Hidden Markov Models (HMMs) and Support Vector Machines. Both architectures are not made to consider the full dynamic character of speech. However, HMMs are able to capture the temporal characteristics of speech on phoneme, word, or utterance level but fail to learn the dynamics of the input signal on short time scales (e.g., frame rate). The use of dynamical features (first and second derivatives of speech features) attenuates this problem. We propose the use of Segmented-Memory Recurrent Neural Networks to learn the full spectrum of speech dynamics. Therefore, the dynamical features can be removed form the input data. The resulting neural network classifier is compared to HMMs that use the reduced feature set as well as to HMMs that work with the full set of features. The networks perform comparable to HMMs while using significantly less features.

[1]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[2]  Chun Chen,et al.  A robust multimodal approach for emotion recognition , 2008, Neurocomputing.

[3]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[4]  Friedhelm Schwenker,et al.  Maximum Echo-State-Likelihood Networks for Emotion Recognition , 2010, ANNPR.

[5]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6]  Andreas Wendemuth,et al.  Processing affected speech within human machine interaction , 2009, INTERSPEECH.

[7]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[8]  Andreas Wendemuth,et al.  Determining optimal signal features and parameters for HMM-based emotion classification , 2010, Melecon 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference.

[9]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[10]  Kazuyuki Shinohara,et al.  Discrimination between mothers’ infant- and adult-directed speech using hidden Markov models , 2011, Neuroscience Research.

[11]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[12]  Björn W. Schuller,et al.  On the Influence of Phonetic Content Variation for Acoustic Emotion Recognition , 2008, PIT.

[13]  Günther Palm,et al.  Real-Time Emotion Recognition Using Echo State Networks , 2008, PIT.

[14]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[15]  Andreas Wendemuth,et al.  Heading toward to the natural way of human-machine interaction: the nimitek project , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[16]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[17]  Fakhri Karray,et al.  Speech Emotion Recognition using Gaussian Mixture Vector Autoregressive Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Valery A. Petrushin,et al.  Emotion recognition in speech signal: experimental study, development, and application , 2000, INTERSPEECH.

[19]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Vyacheslav P. Tuzlukov,et al.  Signal detection theory , 2001 .

[21]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[22]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[23]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[24]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[25]  A. Mehrabian Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[26]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[27]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[28]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[29]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  N. Burgess,et al.  Temporal Grouping Effects in Immediate Recall: A Working Memory Analysis , 1996 .

[31]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[32]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[33]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[34]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[35]  P. Ekman Are there basic emotions? , 1992, Psychological review.

[36]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[37]  Narendra S. Chaudhari,et al.  Segmented-Memory Recurrent Neural Networks , 2009, IEEE Transactions on Neural Networks.

[38]  Andreas Wendemuth,et al.  A Simple Recurrent Network for Implicit Learning of Temporal Sequences , 2010, Cognitive Computation.

[39]  Diego H. Milone,et al.  Spoken emotion recognition using hierarchical classifiers , 2011, Comput. Speech Lang..

[40]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[41]  David Philippou-Hübner,et al.  Determining optimal features for emotion recognition from speech by applying an evolutionary algorithm , 2010, INTERSPEECH.

[42]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[43]  Marilyn K. Rigby,et al.  Influence of digit grouping on memory for telephone numbers. , 1963 .

[44]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[45]  Andreas Wendemuth,et al.  Implicit Sequence Learning - A Case Study with a 4-2-4 Encoder Simple Recurrent Network , 2010, IJCCI.

[46]  Friedhelm Schwenker,et al.  A Hidden Markov Model Based Approach for Facial Expression Recognition in Image Sequences , 2010, ANNPR.

[47]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[48]  John H. L. Hansen,et al.  Angry emotion detection from real-life conversational speech by leveraging content structure , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  R.G. Shiavi,et al.  Distinguishing depression and suicidal risk in men using GMM based frequency contents of affective vocal tract response , 2008, 2008 International Conference on Control, Automation and Systems.