On Enhancing Speech Emotion Recognition using Generative Adversarial Networks

Generative Adversarial Networks (GANs) have gained a lot of attention from machine learning community due to their ability to learn and mimic an input data distribution. GANs consist of a discriminator and a generator working in tandem playing a min-max game to learn a target underlying data distribution; when fed with data-points sampled from a simpler distribution (like uniform or Gaussian distribution). Once trained, they allow synthetic generation of examples sampled from the target distribution. We investigate the application of GANs to generate synthetic feature vectors used for speech emotion recognition. Specifically, we investigate two set ups: (i) a vanilla GAN that learns the distribution of a lower dimensional representation of the actual higher dimensional feature vector and, (ii) a conditional GAN that learns the distribution of the higher dimensional feature vectors conditioned on the labels or the emotional class to which it belongs. As a potential practical application of these synthetically generated samples, we measure any improvement in a classifier's performance when the synthetic data is used along with real data for training. We perform cross-validation analyses followed by a cross-corpus study.

[1]  Stefan Scherer,et al.  Learning representations of emotional speech with deep convolutional generative adversarial networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Emily Mower Provost,et al.  Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[3]  Rahul Gupta,et al.  Predicting client's inclination towards target behavior change in motivational interviewing and investigating the role of laughter , 2014, INTERSPEECH.

[4]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[5]  Rahul Gupta,et al.  Predicting Affective Dimensions Based on Self Assessed Depression Severity , 2016, INTERSPEECH.

[6]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[7]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[8]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[9]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[10]  Rahul Gupta,et al.  Classification of emotional content of sighs in dyadic human interactions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[12]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[13]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Saurabh Sahu,et al.  Adversarial Auto-Encoders for Speech Based Emotion Recognition , 2017, INTERSPEECH.

[16]  Carlos Busso,et al.  MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception , 2017, IEEE Transactions on Affective Computing.

[17]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).