Variational Autoencoders for Learning Latent Representations of Speech Emotion

Latent representation of data in unsupervised fashion is a very interesting process. It provides more relevant features that can enhance the performance of a classifier. For speech emotion recognition tasks generating effective features is very crucial. Recently, deep generative models such as Variational Autoencoders (VAEs) have gained enormous success to model natural images. Being inspired by that in this paper, we use VAE for the modeling of emotions in human speech. We derive the latent representation of speech signal and use this for classification of emotions. We demonstrate that features learned by VAEs can achieve state-of-the-art emotion recognition results.

[1]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[2]  Xiaolin Hu,et al.  Recurrent convolutional neural network for speech processing , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Tsao Yu,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016 .

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[8]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[9]  Niko Brümmer,et al.  Tied Variational Autoencoder Backends for i-Vector Speaker Recognition , 2017, INTERSPEECH.

[10]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[11]  Jordi Bonada,et al.  Modeling and Transforming Speech Using Variational Autoencoders , 2016, INTERSPEECH.

[12]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[13]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[14]  Björn W. Schuller,et al.  Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[16]  Gwenn Englebienne,et al.  Learning spectro-temporal features with 3D CNNs for speech emotion recognition , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[17]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[18]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[19]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Johanna D. Moore,et al.  Emotion recognition in spontaneous and acted dialogues , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[21]  Björn W. Schuller,et al.  Semisupervised Autoencoders for Speech Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[23]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[24]  LinLin Shen,et al.  Deep Feature Consistent Variational Autoencoder , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[26]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[28]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[29]  Johanna D. Moore,et al.  Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[30]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[31]  Khe Chai Sim,et al.  Learning utterance-level normalisation using Variational Autoencoders for robust automatic speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[32]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[33]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.