An End-to-End Deep Learning Framework with Speech Emotion Recognition of Atypical Individuals

The goal of the ongoing ComParE 2018 Atypical Affect subchallenge is to recognize the emotional states of atypical individuals. In this work, we present three modeling methods under the end-to-end learning framework, namely CNN combined with extended features, CNN+RNN and ResNet, respectively. Furthermore, we investigate multiple data augmentation, balancing and sampling methods to further enhance the system performance. The experimental results show that data balancing and augmentation increase the unweighted accuracy (UAR) by 10% absolutely. After score level fusion, our proposed system achieves 48.8% UAR on the develop dataset.

[1]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Lonce L. Wyse,et al.  Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[4]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[5]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chuan Wang,et al.  Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[8]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[9]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[10]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[11]  Mohamed Kamal Omar,et al.  Robust language identification using convolutional neural network features , 2014, INTERSPEECH.

[12]  R. P. Ramachandran,et al.  Robust speaker recognition: a feature-based approach , 1996, IEEE Signal Processing Magazine.

[13]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.

[14]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[15]  Stefanos Zafeiriou,et al.  End2You - The Imperial Toolkit for Multimodal Profiling by End-to-End Learning , 2018, ArXiv.