Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition

This study addresses a situation in practice where training and test samples come from different corpora - here in acoustic emotion recognition. In this situation, a model is trained on one database while tested on another disjoint one. The typical inherent mismatch between the corpora and by that between test and training set usually leads to significant performance degradation. To cope with this problem when no training data from the target domain exists, we propose a `shared-hidden-layer autoencoder' (SHLA) approach for learning common feature representations shared across the training and test set in order to reduce the discrepancy in them. To exemplify effectiveness of our approach, we select the Interspeech Emotion Challenge's FAU Aibo Emotion Corpus as test database and two other publicly available databases as training set for extensive evaluation. The experimental results show that our SHLA method significantly improves over the baseline performance and outperforms today's state-of-the-art domain adaptation methods.

[1]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[2]  Larry P. Heck,et al.  Robustness to telephone handset distortion in speaker recognition by discriminative feature design , 2000, Speech Commun..

[3]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[4]  Björn W. Schuller,et al.  Audiovisual Behavior Modeling by Combined Feature Spaces , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Yun Lei,et al.  A novel feature extraction strategy for multi-stream robust emotion identification , 2010, INTERSPEECH.

[6]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[7]  Björn Schuller,et al.  The Automatic Recognition of Emotions in Speech , 2011 .

[8]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[9]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[10]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11]  Jude Shavlik,et al.  Chapter 11 Transfer Learning , 2009 .

[12]  Björn Schuller,et al.  Cross-Corpus Classification of Realistic Emotions - Some Pilot Experiments , 2010, LREC 2010.

[13]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[16]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[17]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[18]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[19]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[20]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Rui Xia,et al.  Using denoising autoencoder for emotion recognition , 2013, INTERSPEECH.

[23]  Florian Eyben,et al.  Towards a standard set of acoustic features for the processing of emotion in speech. , 2010 .

[24]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[25]  Björn W. Schuller,et al.  Recognizing Affect from Linguistic Information in 3D Continuous Space , 2011, IEEE Transactions on Affective Computing.

[26]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[27]  Björn W. Schuller,et al.  Unsupervised learning in cross-corpus acoustic emotion recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[28]  Takafumi Kanamori,et al.  Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection , 2008, NIPS.