A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks

Speech emotion recognition continues to attract a lot of research especially under mixed-language scenarios. Here, we show that emotion is language dependent and that enhanced emotion recognition systems can be built when the language is known. We propose a two-stage emotion recognition system that starts by identifying the language, followed by a dedicated language-dependent recognition system for identifying the type of emotion. The system is able to recognize accurately the four main types of emotion, namely neutral, happy, angry, and sad. These types of emotion states are widely used in practical setups. To keep the computation complexity low, we identify the language using a feature vector consisting of energies from a basic wavelet decomposition. A hidden Markov model (HMM) is then used to track the changes of this vector to identify the language, achieving recognition accuracy close to 100%. Once the language is identified, a set of speech processing features including pitch and MFCCs are used with a neural network (NN) architecture to identify the emotion type. The results show that that identifying the language first can substantially improve the overall accuracy in identifying emotions. The overall accuracy achieved with the proposed system reached more than 93%. To test the robustness of the proposed methodology, we also used a Gaussian mixture model (GMM) for both language identification and emotion recognition. Our proposed HMM-NN approach showed a better performance than the GMM-based approach. More importantly, we tested the proposed algorithm with 6 emotions which are showed that the overall accuracy continues to be excellent, while the performance of the GMM-based approach deteriorates substantially. It is worth noting that the performance we achieved is close to the one attained for single language emotion recognition systems and outperforms by far recognition systems without language identification (around 60%). The work shows the strong correlation between language and type of emotion, and can further be extended to other scenarios including gender-based, facial expression-based, and age-based emotion recognition.

[1]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[2]  Shrikanth S. Narayanan,et al.  Detecting emotional state of a child in a conversational computer game , 2011, Comput. Speech Lang..

[3]  K. Sreenivasa Rao,et al.  Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM , 2013, 2013 National Conference on Communications (NCC).

[4]  Dongrui Wu,et al.  Acoustic feature analysis in speech emotion primitives estimation , 2010, INTERSPEECH.

[5]  Erik Marchi,et al.  Enhancing Multilingual Recognition of Emotion in Speech by Language Identification , 2016, INTERSPEECH.

[6]  Salih M. Al-Qaraawi,et al.  Wavelet transform based features vector extraction in isolated words speech recognition system , 2014, 2014 9th International Symposium on Communication Systems, Networks & Digital Sign (CSNDSP).

[7]  Jiri Pribil,et al.  Formant features statistical analysis of male and female emotional speech in Czech and Slovak , 2012, 2012 35th International Conference on Telecommunications and Signal Processing (TSP).

[8]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[9]  Ji Xi,et al.  Practical Speech Emotion Recognition Based on Online Learning: From Acted Data to Elicited Data , 2013 .

[10]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[11]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Hiok Chai Quek,et al.  Speech emotion recognition using auditory cortex , 2007, 2007 IEEE Congress on Evolutionary Computation.

[13]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[14]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[15]  Ling Guan,et al.  Language Independent Recognition of Human Emotion using Artificial Neural Networks , 2008, Int. J. Cogn. Informatics Nat. Intell..

[16]  Ali Ganoun,et al.  Spoken Arabic Digits Recognition Using Discrete Wavelet , 2014, 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation.

[17]  F. Milinazzo,et al.  Formant location from LPC analysis data , 1993, IEEE Trans. Speech Audio Process..

[18]  W. Thompson,et al.  Decoding speech prosody in five languages , 2006 .

[19]  Björn W. Schuller,et al.  Cross-language acoustic emotion recognition: An overview and some tendencies , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[20]  Bin Yang,et al.  Emotion recognition from speech signals using new harmony features , 2010, Signal Process..

[21]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..

[22]  João Paulo Papa,et al.  Spoken emotion recognition through optimum-path forest classification using glottal features , 2010, Comput. Speech Lang..

[23]  Marie Postma,et al.  Context and Priming Effects in the Recognition of Emotion of Old and Young Listeners , 2011, INTERSPEECH.

[24]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Joan Claudi Socoró,et al.  GTM-URL contribution to the INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[26]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[27]  Amit Konar,et al.  Emotion Recognition: A Pattern Analysis Approach , 2015 .

[28]  Fadi Al Machot,et al.  A novel real-time emotion detection system from audio streams based on Bayesian Quadratic Discriminate Classifier for ADAS , 2011, Proceedings of the Joint INDS'11 & ISTET'11.

[29]  Ling Guan,et al.  A neural network approach for human emotion recognition in speech , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[30]  Enes Yuncu,et al.  Automatic Speech Emotion Recognition Using Auditory Models with Binary Decision Tree and SVM , 2014, 2014 22nd International Conference on Pattern Recognition.

[31]  Björn Schuller,et al.  Cross-Corpus Classification of Realistic Emotions - Some Pilot Experiments , 2010, LREC 2010.