A Three-Layer Emotion Perception Model for Valence and Arousal-Based Detection from Multilingual Speech

Automated emotion detection from speech has recently shifted from monolingual to multilingual tasks for human-like interaction in real-life where a system can handle more than a single input language. However, most work on monolingual emotion detection is difficult to generalize in multiple languages, because the optimal feature sets of the work differ from one language to another. Our study proposes a framework to design, implement, and validate an emotion detection system using multiple corpora. A continuous dimensional space of valence and arousal is first used to describe the emotions. A three-layer model incorporated with fuzzy inference systems is then used to estimate two dimensions. Speech features derived from prosodic, spectral, and glottal waveform are examined and selected to capture emotional cues. The results of this new system outperformed the existing state-of-the-art system by yielding a smaller mean absolute error and higher correlation between estimates and human evaluators. Moreover, results for speaker independent validation are comparable to human evaluators.

[1]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[3]  K. Scherer Personality inference from voice quality: The loud voice of extroversion. , 1978 .

[4]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[5]  Changxue Ma,et al.  Toward A Speaker-Independent Real-Time Affect Detection System , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6]  A. Rosenberg Effect of glottal pulse shape on the quality of natural vowels. , 1969, The Journal of the Acoustical Society of America.

[7]  Dongrui Wu,et al.  Acoustic feature analysis in speech emotion primitives estimation , 2010, INTERSPEECH.

[8]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[9]  E. Mizutani,et al.  Neuro-Fuzzy and Soft Computing-A Computational Approach to Learning and Machine Intelligence [Book Review] , 1997, IEEE Transactions on Automatic Control.

[10]  Klaus R. Scherer,et al.  Vocal communication of emotion , 2000 .

[11]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[12]  K. Scherer,et al.  Beyond arousal: valence and potency/control cues in the vocal expression of emotion. , 2010, The Journal of the Acoustical Society of America.

[13]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[14]  J. Russell A circumplex model of affect. , 1980 .

[15]  Brigitte Krenn,et al.  Fully generated scripted dialogue for embodied agents , 2008, Artif. Intell..

[16]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[17]  Gerald M. Knapp,et al.  Dimensionality Reduction and Classification Analysis on the Audio Section of the SEMAINE Database , 2011, ACII.

[18]  Elliot Moore,et al.  Algorithm for automatic glottal waveform estimation without the reliance on precise glottal closure information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[20]  E. Brunswik,et al.  Historical and Thematic Relations of Psychology to Other Sciences , 1956 .

[21]  K E Cummings,et al.  Analysis of the glottal excitation of emotionally styled and stressed speech. , 1995, The Journal of the Acoustical Society of America.

[22]  Donna Erickson,et al.  Comparison of Japanese expressive speech perception by Japanese and Taiwanese listeners , 2008 .

[23]  Masato Akagi,et al.  Improving speech emotion dimensions estimation using a three-layer model of human perception , 2014 .

[24]  Masato Akagi,et al.  Toward improving estimation accuracy of emotion dimensions in bilingual scenario based on three-layered model , 2015, 2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[25]  Masashi Unoki,et al.  Modulation Spectral Features for Predicting Vocal Emotion Recognition by Simulated Cochlear Implants , 2016, INTERSPEECH.

[26]  A. Tickle,et al.  ENGLISH AND JAPANESE SPEAKERS ’ EMOTION VOCALISATION AND RECOGNITION : A COMPARISON HIGHLIGHTING VOWEL QUALITY , 2000 .

[27]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[28]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..