Enhancing Multilingual Recognition of Emotion in Speech by Language Identification

We investigate, for the first time, if applying model selection based on automatic language identification (LID) can improve multilingual recognition of emotion in speech. Six emotional speech corpora from three language families (Germanic, Romance, Sino-Tibetan) are evaluated. The emotions are represented by the quadrants in the arousal/valence plane, i. e., positive/negative arousal/valence. Four selection approaches for choosing an optimal training set depending on the current language are compared: within the same language family, across language family, use of all available corpora, and selection based on the automatic LID. We found that, on average, the proposed LID approach for selecting training corpora is superior to using all the available corpora when the spoken language is not known.

[1]  Hillary Anger Elfenbein,et al.  On the universality and cultural specificity of emotion recognition: a meta-analysis. , 2002, Psychological bulletin.

[2]  Erik Marchi,et al.  Recent developments and results of ASC-Inclusion: An Integrated Internet-Based Environment for Social Inclusion of Children with Autism Spectrum Conditions , 2015, IUI 2015.

[3]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[4]  Björn W. Schuller,et al.  Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lukás Burget,et al.  Application of speaker- and language identification state-of-the-art techniques for emotion recognition , 2011, Speech Commun..

[6]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[7]  Pan Liu,et al.  Recognizing vocal emotions in Mandarin Chinese: A validated database of Chinese vocal emotional stimuli , 2012, Behavior Research Methods.

[8]  Björn W. Schuller,et al.  Cross-language acoustic emotion recognition: An overview and some tendencies , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[9]  Fabien Ringeval,et al.  Chapter 8 Voice-enabled assistive robots for handling autism spectrum conditions : an examination of the role of prosody , 2014 .

[10]  Lukás Burget,et al.  iVector-based discriminative adaptation for automatic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[12]  Tomi Kinnunen,et al.  Exploring ANN back-ends for i-vector based speaker age estimation , 2015, INTERSPEECH.

[13]  Rafael A. Calvo,et al.  Frontiers of Affect-Aware Learning Technologies , 2012, IEEE Intelligent Systems.

[14]  K. Scherer,et al.  Emotion Inferences from Vocal Expression Correlate Across Languages and Cultures , 2001 .

[15]  Yonghong Yan,et al.  Speech Emotion Recognition Using Both Spectral and Prosodic Features , 2009, 2009 International Conference on Information Engineering and Computer Science.

[16]  H. Barrett,et al.  Vocal Emotion Recognition Across Disparate Cultures , 2008 .

[17]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[19]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[20]  Daniel Lundqvist,et al.  The EU-Emotion Stimulus Set: A validation study , 2016, Behavior research methods.

[21]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Erik Marchi,et al.  Voice Emotion Games: Language and Emotion in the Voice of Children with Autism Spectrum Conditio , 2015, IUI 2015.

[23]  Vidhyasaharan Sethu,et al.  Variability compensation in small data: Oversampled extraction of i-vectors for the classification of depressed speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Rada Mihalcea,et al.  Multilingual Subjectivity: Are More Languages Better? , 2010, COLING.

[25]  Erik Marchi,et al.  Typicality and emotion in the voice of children with autism spectrum condition: evidence across three languages , 2015, INTERSPEECH.

[26]  Björn Schuller,et al.  openSMILE:): the Munich open-source large-scale multimedia feature extractor , 2015, ACMMR.

[27]  Albino Nogueiras,et al.  Interface Databases: Design and Collection of a Multilingual Emotional Speech Database , 2002, LREC.

[28]  Pavel Matejka,et al.  Multilingual bottleneck features for language recognition , 2015, INTERSPEECH.

[29]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.