On the Correlation and Transferability of Features Between Automatic Speech Recognition and Speech Emotion Recognition

The correlation between Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) is poorly understood. Studying such correlation may pave the way for integrating both tasks into a single system or may provide insights that can aid in advancing both systems such as improving ASR in dealing with emotional speech or embedding linguistic input into SER. In this paper, we quantify the relation between ASR and SER by studying the relevance of features learned between both tasks in deep convolutional neural networks using transfer learning. Experiments are conducted using the TIMIT and IEMOCAP databases. Results reveal an intriguing correlation between both tasks, where features learned in some layers particularly towards initial layers of the network for either task were found to be applicable to the other task with varying degree.

[1]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[3]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  安藤 寛,et al.  Cross-Validation , 1952, Encyclopedia of Machine Learning and Data Mining.

[5]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[7]  Eduardo Coutinho,et al.  Transfer learning emotion manifestation across music and speech , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[8]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[11]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[12]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[13]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[16]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[17]  Carlos Busso,et al.  Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition , 2013, IEEE Transactions on Affective Computing.

[18]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[19]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Rosalind W. Picard,et al.  A computational model for the automatic recognition of affect in speech , 2004 .

[22]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[23]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[25]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[26]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[27]  Catherine Lai,et al.  RECOGNIZING EMOTIONS IN DIALOGUES WITH DISFLUENCIES AND NON-VERBAL VOCALISATIONS , 2015 .

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  Margaret Lech,et al.  Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[30]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[31]  Chaitali Chakrabarti,et al.  A multi-modal approach to emotion recognition using undirected topic models , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[32]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).