Speech Emotion Recognition Using Semi-supervised Learning with Ladder Networks

As a major branch of speech processing, speech emotion recognition has drawn much attention of researchers. Prior works have proposed a variety of models and feature sets for training a system. In this paper, we propose to use semi-supervised learning with ladder networks to generate robust feature representation for speech emotion recognition. In our method, the input of ladder network is the normalized static acoustic features and is mapped to high level hidden representations. The model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by back-propagation. The extracted hidden representations are used as emotional features in SVM model for speech emotion recognition. The experimental results, performed on IEMOCAP database, show 2.6% higher performance than denoising auto-encoder, and 5.3% than the static acoustic features.

[1]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[2]  Shashidhar G. Koolagudi,et al.  Robust Emotion Recognition using Spectral and Prosodic Features , 2013, Springer Briefs in Electrical and Computer Engineering.

[3]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[4]  Inma Hernáez,et al.  Feature Analysis and Evaluation for Automatic Emotion Identification in Speech , 2010, IEEE Transactions on Multimedia.

[5]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[6]  Hossein Mobahi,et al.  Deep Learning via Semi-supervised Embedding , 2012, Neural Networks: Tricks of the Trade.

[7]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[8]  Yang Liu,et al.  DBN-ivector Framework for Acoustic Emotion Recognition , 2016, INTERSPEECH.

[9]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[10]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Qirong Mao,et al.  Learning speech emotion features by joint disentangling-discrimination , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[13]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[14]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Björn W. Schuller,et al.  Linked Source and Target Domain Subspace Feature Transfer Learning -- Exemplified by Speech Emotion Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[16]  C. Vinola,et al.  A Survey on Human Emotion Recognition Approaches, Databases and Applications , 2015 .

[17]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[18]  Rui Xia,et al.  Using denoising autoencoder for emotion recognition , 2013, INTERSPEECH.

[19]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[20]  Björn W. Schuller,et al.  Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).