Multilingual Speech Emotion Recognition System Based on a Three-Layer Model

Speech Emotion Recognition (SER) systems currently are focusing on classifying emotions on each single language. Since optimal acoustic sets are strongly language dependent, to achieve a generalized SER system working for multiple languages, issues of selection of common features and retraining are still challenging. In this paper, we therefore present a SER system in a multilingual scenario from perspective of human perceptual processing. The goal is twofold. Firstly, to predict multilingual emotion dimensions accurately such as human annotations. To this end, a three layered model consist of acoustic features, semantic primitives, emotion dimensions, along with Fuzzy Inference System (FIS) were studied. Secondly, by knowledge of human perception of emotion among languages in dimensional space, we adopt direction and distance as common features to detect multilingual emotions. Results of estimation performance of emotion dimensions comparable to human evaluation is furnished, and classification rates that are close to monolingual SER system performed are achieved.

[1]  Donna Erickson,et al.  Comparison of Japanese expressive speech perception by Japanese and Taiwanese listeners , 2008 .

[2]  K. Scherer Personality inference from voice quality: The loud voice of extroversion. , 1978 .

[3]  Masato Akagi,et al.  A three-layered model for expressive speech perception , 2008, Speech Commun..

[4]  Masato Akagi,et al.  A study on perception of emotional states in multiple languages on Valence-Activation approach , 2015 .

[5]  E. Brunswik,et al.  Historical and Thematic Relations of Psychology to Other Sciences , 1956 .

[6]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[7]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[8]  Masato Akagi,et al.  Improving speech emotion dimensions estimation using a three-layer model of human perception , 2014 .

[9]  Masato Akagi,et al.  Toward improving estimation accuracy of emotion dimensions in bilingual scenario based on three-layered model , 2015, 2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[10]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[11]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[12]  Yonghong Yan,et al.  A Hybrid Speech Emotion Recognition System Based on Spectral and Prosodic Features , 2010, IEICE Trans. Inf. Syst..

[13]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[14]  Dongrui Wu,et al.  Acoustic feature analysis in speech emotion primitives estimation , 2010, INTERSPEECH.

[15]  Masato Akagi,et al.  Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[16]  Masato Akagi,et al.  Cross-lingual speech emotion recognition system based on a three-layer model for human perception , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[17]  Björn Schuller,et al.  Emotion recognition in the noise applying large acoustic feature sets , 2006, Speech Prosody 2006.