Learning speech emotion features by joint disentangling-discrimination

Speech plays an important part in human-computer interaction. As a major branch of speech processing, speech emotion recognition (SER) has drawn much attention of researchers. Excellent discriminant features are of great importance in SER. However, emotion-specific features are commonly mixed with some other features. In this paper, we introduce an approach to pull apart these two parts of features as much as possible. First we employ an unsupervised feature learning framework to achieve some rough features. Then these rough features are further fed into a semi-supervised feature learning framework. In this phase, efforts are made to disentangle the emotion-specific features and some other features by using a novel loss function, which combines reconstruction penalty, orthogonal penalty, discriminative penalty and verification penalty. Orthogonal penalty is utilized to disentangle emotion-specific features and other features. The discriminative penalty enlarges inter-emotion variations, while the verification penalty reduces the intra-emotion variations. Evaluations on the FAU Aibo emotion database show that our approach can improve the speech emotion classification performance.

[1]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[4]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[5]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[6]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[7]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[8]  Qirong Mao,et al.  Speech emotion recognition with unsupervised feature learning , 2015, Frontiers of Information Technology & Electronic Engineering.

[9]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[10]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Chang Dong Yoo,et al.  Loss-Scaled Large-Margin Gaussian Mixture Models for Speech Emotion Classification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[14]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[15]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Vidhyasaharan Sethu,et al.  Speaker variability in speech based emotion models - Analysis and normalisation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[21]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[22]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[23]  Li Deng,et al.  Sequence classification using the high-level features extracted from deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ling Guan,et al.  A neural network approach for human emotion recognition in speech , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[25]  Ya Li,et al.  Improving generation performance of speech emotion recognition by denoising autoencoders , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[26]  Vidhyasaharan Sethu,et al.  Speech Based Emotion Recognition , 2015 .

[27]  Emily Mower Provost,et al.  Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[28]  Vidhyasaharan Sethu,et al.  Speaker variability in emotion recognition - an adaptation based approach , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[30]  Ragini Verma,et al.  Class-level spectral features for emotion recognition , 2010, Speech Commun..

[31]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.