Speech emotion recognition using multichannel parallel convolutional recurrent neural networks based on gammatone auditory filterbank

Speech Emotion Recognition (SER) using deep learning methods based on computational auditory models of human auditory system is a new way to identify emotional state. In this paper, we propose to utilize multichannel parallel convolutional recurrent neural networks (MPCRNN) to extract salient features based on Gammatone auditory filterbank from raw waveform and reveal that this method is effective for speech emotion recognition. We first divide the speech signal into segments, and then get multichannel data using Gammatone auditory filterbank, which is used as a first stage before applying MPCRNN to get the most relevant features for emotion recognition from speech. We subsequently obtain emotion state probability distribution for each speech segment. Eventually, utterance-level features are constructed from segment-level probability distributions and fed into support vector machine (SVM) to identify the emotions. According to the experimental results, speech emotion features can be effectively learned utilizing the proposed deep learning approach based on Gammatone auditory filterbank.

[1]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[2]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[3]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[4]  Eric O. Postma,et al.  Speech Emotion Recognition with Log-Gabor Filters , 2016, ICAART.

[5]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[6]  Masashi Unoki,et al.  Recognition of vocal emotion in noise-vocoded speech by normal hearing and cochlear implant listeners , 2016 .

[7]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[8]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[10]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[13]  Eero P. Simoncelli,et al.  Summary statistics in auditory perception , 2013, Nature Neuroscience.

[14]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Emily Mower Provost,et al.  Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[17]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[18]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[19]  Masashi Unoki,et al.  Modulation Spectral Features for Predicting Vocal Emotion Recognition by Simulated Cochlear Implants , 2016, INTERSPEECH.

[20]  Christian Biemann,et al.  Using representation learning and out-of-domain data for a paralinguistic speech task , 2015, INTERSPEECH.

[21]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Bin Zhang,et al.  Study on CNN in the recognition of emotion in audio and images , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[23]  Masashi Unoki,et al.  A method of signal extraction from noisy signal based on auditory scene analysis , 1997, Speech Commun..

[24]  William A. Sethares,et al.  Convergence Analysis of Two Loss Functions in Soft-Max Regression , 2016, IEEE Transactions on Signal Processing.

[25]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[26]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).