LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition

Acoustic emotion recognition is a popular and central research direction in paralinguistic analysis, due its relation to a wide range of affective states/traits and manifold applications. Developing highly generalizable models still remains as a challenge for researchers and engineers, because of multitude of nuisance factors. To assert generalization, deployed models need to handle spontaneous speech recorded under different acoustic conditions compared to the training set. This requires that the models are tested for cross-corpus robustness. In this work, we first investigate the suitability of Long-Short-Term-Memory (LSTM) models trained with timeand space-continuously annotated affective primitives for cross-corpus acoustic emotion recognition. We next employ an effective approach to use the frame level valence and arousal predictions of LSTM models for utterance level affect classification and apply this approach on the ComParE 2018 challenge corpora. The proposed method alone gives motivating results both on development and test set of the Self-Assessed Affect Sub-Challenge. On the development set, the cross-corpus prediction based method gives a boost to performance when fused with top components of the baseline system. Results indicate the suitability of the proposed method for both time-continuous and utterance level cross-corpus acoustic emotion recognition tasks.

[1]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Björn W. Schuller,et al.  Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[3]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[4]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[5]  Heysem Kaya,et al.  Introducing Weighted Kernel Classifiers for Handling Imbalanced Paralinguistic Corpora: Snoring, Addressee and Cold , 2017, INTERSPEECH.

[6]  Heysem Kaya,et al.  Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks , 2016, INTERSPEECH.

[7]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[8]  Fabien Ringeval,et al.  Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models , 2017, IEEE Transactions on Affective Computing.

[9]  Mohan M. Trivedi,et al.  Speech Emotion Analysis: Exploring the Role of Context , 2010, IEEE Transactions on Multimedia.

[10]  Björn W. Schuller,et al.  openXBOW - Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit , 2016, J. Mach. Learn. Res..

[11]  Shrikanth Narayanan,et al.  The USC Creative IT Database: A Multimodal Database of Theatrical Improvisation , 2010 .

[12]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[13]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[14]  Fabien Ringeval,et al.  Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio , 2016, IJCAI.

[15]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[16]  Heysem Kaya,et al.  Efficient and effective strategies for cross-corpus acoustic emotion recognition , 2018, Neurocomputing.

[17]  Björn W. Schuller,et al.  Semisupervised Autoencoders for Speech Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.

[19]  Yiqiang Chen,et al.  Weighted extreme learning machine for imbalance learning , 2013, Neurocomputing.

[20]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[21]  George Trigeorgis,et al.  The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring , 2017, INTERSPEECH.

[22]  Albert Ali Salah,et al.  Fisher vectors with cascaded normalization for paralinguistic analysis , 2015, INTERSPEECH.

[23]  Björn W. Schuller,et al.  Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Wolfgang Minker,et al.  Contextual Dependencies in Time-Continuous Multidimensional Affect Recognition , 2018, LREC.

[25]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.