Speaker Awareness for Speech Emotion Recognition

The idea of recognizing human emotion through speech (SER) has recently received considerable attention from the research community, mostly due to the current machine learning trend. Nevertheless, even the most successful methods are still rather lacking in terms of adaptation to specific speakers and scenarios, evidently reducing their performance when compared to humans. In this paper, we evaluate a largescale machine learning model for classification of emotional states. This model has been trained for speaker iden- tification but is instead used here as a front-end for extracting robust features from emotional speech. We aim to verify that SER improves when some speak- er ’ s emotional prosody cues are considered. Experiments using various state-of- the-art classifiers are carried out, using the Weka software, so as to evaluate the robustness of the extracted features. Considerable improvement is observed when comparing our results with other SER state-of-the-art techniques.

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  Kosai Raoof,et al.  Automatic Speech Emotion Recognition Using Machine Learning , 2019, Social Media and Machine Learning.

[3]  Julia Fischer,et al.  Effect of Acting Experience on Emotion Expression and Recognition in Voice: Non-Actors Provide Better Stimuli than Expected , 2015, Journal of nonverbal behavior.

[4]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[5]  Giovanni Costantini,et al.  EMOVO Corpus: an Italian Emotional Speech Database , 2014, LREC.

[6]  Ling Guan,et al.  Multimodal information fusion of audiovisual emotion recognition using novel information theoretic tools , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[7]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[8]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[11]  Rajib Rana,et al.  Transfer Learning for Improving Speech Emotion Classification Accuracy , 2018, INTERSPEECH.

[12]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[13]  Björn W. Schuller,et al.  Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[14]  Björn W. Schuller,et al.  Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition , 2017, IEEE Signal Processing Letters.

[15]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[16]  Shaun J. Canavan,et al.  Ubiquitous Emotion Recognition Using Audio and Video Data , 2018, UbiComp/ISWC Adjunct.

[17]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[18]  Maie Bachmann,et al.  Audiovisual emotion recognition in wild , 2018, Machine Vision and Applications.

[19]  Giovanni B. Moneta,et al.  Metacognition, perceived stress, and negative emotion , 2008 .

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Stefan Ultes,et al.  Emotions are a personal thing: Towards speaker-adaptive emotion recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  Silke Paulmann,et al.  How Psychological Stress Affects Emotional Prosody , 2016, PloS one.

[27]  James D. Edge,et al.  Audio-visual feature selection and reduction for emotion classification , 2008, AVSP.