Toward relaying an affective Speech-to-Speech translator: Cross-language perception of emotional state represented by emotion dimensions

Affective speech-to-speech translation (S2ST) is to preserve the affective state conveyed in the speaker's message. The ultimate goal of this study is to construct an affective S2ST system that has the ability to transform the emotional states of a spoken utterance from one language to another language. A universal automatic speech-emotion-recognition system is required to detect emotional state regardless of language. Therefore, this study investigates commonalities and differences of emotion perception across multi-languages. Thirty subjects from three countries, Japan, China and Vietnam, evaluate three emotional speech databases, Japanese, Chinese and German, in valence-activation space. The results reveal that directions from neutral to other emotions are similar among subjects groups. However, the estimated degree of emotional state depend on the expressed emotional styles. Moreover, neutral positions were significantly different among subjects groups. Thus, directions and distances from neutral to other emotions could be adopted as features to recognize emotional states for multi-languages.