Transfer Learning for Improving Speech Emotion Classification Accuracy

The majority of existing speech emotion recognition research focuses on automatic emotion detection using training and testing data from the same corpus collected under the same conditions. The performance of such systems has been shown to drop significantly in cross-corpus and cross-language scenarios. To address the problem, this paper exploits a transfer learning technique to improve the performance of speech emotion recognition systems that are novel in cross-language and cross-corpus scenarios. Evaluations on five different corpora in three different languages show that Deep Belief Networks (DBNs) offer better accuracy than previous approaches on cross-corpus emotion recognition, relative to a Sparse Autoencoder and Support Vector Machine (SVM) baseline system. Results also suggest that using a large number of languages for training and using a small fraction of the target data in training can significantly boost accuracy compared with baseline also for the corpus with limited training examples.

[1]  Björn Schuller,et al.  Cross-Corpus Classification of Realistic Emotions - Some Pilot Experiments , 2010, LREC 2010.

[2]  Erik Marchi,et al.  Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[3]  Youngmoo E. Kim,et al.  Learning emotion-based acoustic features with deep belief networks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  Giovanni Costantini,et al.  EMOVO Corpus: an Italian Emotional Speech Database , 2014, LREC.

[6]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[7]  Diego H. Milone,et al.  Emotion Recognition in Never-Seen Languages Using a Novel Ensemble Method with Emotion Profiles , 2017, IEEE Transactions on Affective Computing.

[8]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[9]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[10]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[11]  Zhongzhe Xiao,et al.  Speech emotion recognition cross language families: Mandarin vs. western languages , 2016, 2016 International Conference on Progress in Informatics and Computing (PIC).

[12]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[13]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[14]  Björn Schuller,et al.  The Automatic Recognition of Emotions in Speech , 2011 .

[15]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[18]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[19]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[20]  Peng Hao,et al.  Transfer learning using computational intelligence: A survey , 2015, Knowl. Based Syst..

[21]  Mohammad Mehdi Homayounpour,et al.  A brief survey on deep belief networks and introducing a new object oriented MATLAB toolbox (DeeBNet) , 2014, ArXiv.

[22]  Chenchen Huang,et al.  A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM , 2014 .

[23]  Kazuki Kozuka,et al.  Transfer learning method using multi-prediction deep Boltzmann machines for a small scale dataset , 2015, 2015 14th IAPR International Conference on Machine Vision Applications (MVA).

[24]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[25]  Emily Mower Provost,et al.  Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[26]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[27]  Myung Jong Kim,et al.  Cross-acoustic transfer learning for sound event classification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Mohammad Ali Keyvanrad,et al.  A brief survey on deep belief networks and introducing a new object oriented toolbox ( DeeBNet V 3 . 0 ) , 2016 .

[29]  Rajib Rana,et al.  Emotion Classification from Noisy Speech - A Deep Learning Approach , 2016, ArXiv.

[30]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[31]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[32]  Tong Zhang,et al.  Cross-Corpus Speech Emotion Recognition Based on Domain-Adaptive Least-Squares Regression , 2016, IEEE Signal Processing Letters.

[33]  Yale Chang,et al.  Unsupervised Feature Learning via Sparse Hierarchical Representations [ 1 ] , 2014 .

[34]  Björn W. Schuller,et al.  Synthesized speech for model training in cross-corpus recognition of human emotion , 2012, International Journal of Speech Technology.