Segment Repetition Based on High Amplitude to Enhance a Speech Emotion Recognition

Abstract Speech Emotion Recognition (SER) is a technology developed on a computer to realize a Human-Computer Interaction (HCI). It is a challenging task since the lack of data. Some data augmentation methods have been created to increase the data variation, but they do not significantly improve accuracy. Therefore, a new additional data augmentation method called Segment Repetition based on High Amplitude (SRHA) is proposed to solve this problem. This method makes some repetitions on the segments that have the highest amplitude. An experiment of 10 times data augmentation, using five standard augmentations and the additional SRHA with a Long Short-Term Memory (LSTM) as the classifier, shows that the proposed SRHA significantly increases the SER accuracy from 95.88% to 98.16%. Other experiments for 20 and 40 times data augmentations also show that the SRHA outperforms the five standard augmentations. These indicate that the SRHA is a powerful data augmentation method for SER.

[1]  Qiong Duan,et al.  Speech Emotion Recognition Using Gaussian Mixture Model , 2012 .

[2]  Yafeng Niu,et al.  A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks , 2017, ArXiv.

[3]  Laurence Devillers,et al.  CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation , 2018, Workshop on Speech, Music and Mind (SMM 2018).

[4]  Sartra Wongthanavasu,et al.  Speech emotion recognition using Support Vector Machines , 2013, 2013 5th International Conference on Knowledge and Smart Technology (KST).

[5]  Christian Kaernbach,et al.  Amplitude and amplitude variation of emotional speech , 2008, INTERSPEECH.

[6]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[7]  Xiong Chen,et al.  Automatic Speech Emotion Recognition using Support Vector Machine , 2011, Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.

[8]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[9]  Wenzhen Zhang,et al.  Speech Emotion Recognition Based on SVM and ANN , 2018, International Journal of Machine Learning and Computing.

[10]  Starlet Ben Alex,et al.  Utterance and Syllable Level Prosodic Features for Automatic Emotion Recognition , 2018, 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS).