OCAE: Organization-Controlled Autoencoder for Unsupervised Speech Emotion Analysis

One of the severe obstacles to speech emotion analysis is the lack of reasonable labelled speech signal. Thus, an important issue to be considered is applying an unsupervised method to generate a representation in low dimension to analyze emotions. Such a representation coming from data needs to be stable and meaningful, just like the 2D or 3D representation of emotions elaborated by psychology. In this paper, we propose a fully unsupervised approach, called Organization-Controlled AutoEncoder (OCAE), combining autoencoder with PCA to build an emotional representation. We utilize the result of PCA on speech features to control the organization of the data in the latent space of autoencoder, through adding an organization loss to the classical objective function. Indeed, PCA can keep the organization of the data, whereas autoencoder leads to better discrimination of the data. By combining both, we can take advantage of each method. The results on Emo-DB and SEMAINE database show that our representation generated in an unsupervised manner is meaningful and stable.

[1]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[3]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[4]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[5]  Björn W. Schuller,et al.  Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition , 2017, IEEE Signal Processing Letters.

[6]  Michael D. Robinson,et al.  Measures of emotion: A review , 2009, Cognition & emotion.

[7]  R. Plutchik A GENERAL PSYCHOEVOLUTIONARY THEORY OF EMOTION , 1980 .

[8]  Amit Konar,et al.  Emotion Recognition: A Pattern Analysis Approach , 2015 .

[9]  Günes Karabulut-Kurt,et al.  Perceptual audio features for emotion detection , 2012, EURASIP J. Audio Speech Music. Process..

[10]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[11]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[12]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[13]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[14]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[15]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Tamás D. Gedeon,et al.  EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction , 2018, ICMI.

[17]  Renaud Séguier,et al.  Invariant representation of facial expressions for blended expression recognition on unknown subjects , 2013, Comput. Vis. Image Underst..

[18]  Albert Ali Salah,et al.  Fisher vectors with cascaded normalization for paralinguistic analysis , 2015, INTERSPEECH.

[19]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.

[20]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Albert Ali Salah,et al.  Robust Acoustic Emotion Recognition Based on Cascaded Normalization and Extreme Learning Machines , 2016, ISNN.

[22]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[23]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[24]  Louis-Philippe Morency,et al.  Learning Representations of Affect from Speech , 2015, ICLR 2015.

[25]  Rajib Rana,et al.  Variational Autoencoders for Learning Latent Representations of Speech Emotion , 2017, INTERSPEECH.

[26]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.