Learning Representations of Affect from Speech

There has been a lot of prior work on representation learning for speech recognition applications, but not much emphasis has been given to an investigation of effective representations of affect from speech, where the paralinguistic elements of speech are separated out from the verbal content. In this paper, we explore denoising autoencoders for learning paralinguistic attributes i.e. categorical and dimensional affective traits from speech. We show that the representations learnt by the bottleneck layer of the autoencoder are highly discriminative of activation intensity and at separating out negative valence (sadness and anger) from positive valence (happiness). We experiment with different input speech features (such as FFT and log-mel spectrograms with temporal context windows), and different autoencoder architectures (such as stacked and deep autoencoders). We also learn utterance specific representations by a combination of denoising autoencoders and BLSTM based recurrent autoencoders. Emotion classification is performed with the learnt temporal/dynamic representations to evaluate the quality of the representations. Experiments on a well-established real-life speech dataset (IEMOCAP) show that the learnt representations are comparable to state of the art feature extractors (such as voice quality features and MFCCs) and are competitive with state-of-the-art approaches at emotion and dimensional affect recognition.

[1]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[2]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[3]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[6]  Jürgen Schmidhuber,et al.  Multi-scale pyramidal pooling network for generic steel defect classification , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[7]  K. Scherer What are emotions? And how can they be measured? , 2005 .

[8]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  J. Elman Distributed representations, simple recurrent networks, and grammatical structure , 1991, Machine Learning.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Ming Shao,et al.  Sparse low-rank fusion based deep features for missing modality face recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[16]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[17]  Tony Ezzat,et al.  Discriminative word-spotting using ordered spectro-temporal patch features , 2008, SAPA@INTERSPEECH.

[18]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[19]  Friedhelm Schwenker,et al.  Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification , 2013, Comput. Speech Lang..

[20]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[21]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[22]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Tara N. Sainath,et al.  Improvements to filterbank and delta learning within a deep neural network framework , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Geoffrey E. Hinton,et al.  A new way to learn acoustic events , 2011 .

[25]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.