Auditory-Inspired End-to-End Speech Emotion Recognition Using 3D Convolutional Recurrent Neural Networks Based on Spectral-Temporal Representation

The human auditory system has far superior emotion recognition abilities compared with recent speech emotion recognition systems, so research has focused on designing emotion recognition systems by mimicking the human auditory system. Psychoacoustic and physiological studies indicate that the human auditory system decomposes speech signals into acoustic and modulation frequency components, and further extracts temporal modulation cues. Speech emotional states are perceived from temporal modulation cues using the spectral and temporal receptive field of the neuron. This paper proposes an emotion recognition system in an end-to-end manner using three-dimensional convolutional recurrent neural networks (3D-CRNNs) based on temporal modulation cues. Temporal modulation cues contain four-dimensional spectral-temporal (ST) integration representations directly as the input of 3D-CRNNs. The convolutional layer is used to extract high-level multiscale ST representations, and the recurrent layer is used to extract long-term dependency for emotion recognition. The proposed method was verified on the IEMOCAP database. The results show that our proposed method can exceed the recognition accuracy compared to that of the state-of-the-art systems.

[1]  N Suga,et al.  Analysis of information-bearing elements in complex sounds by auditory neurons of bats. , 1972, Audiology : official organ of the International Society of Audiology.

[2]  Louis-Philippe Morency,et al.  Learning Representations of Affect from Speech , 2015, ICLR 2015.

[3]  K. Sen,et al.  Spectral-temporal Receptive Fields of Nonlinear Auditory Neurons Obtained Using Natural Sounds , 2022 .

[4]  Roy D. Patterson,et al.  Improvement of an IIR asymmetric compensation gammachirp filter , 2001 .

[5]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[6]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Birger Kollmeier,et al.  An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[9]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[10]  A R Moller Unit responses in the rat cochlear nucleus to tones of rapidly varying frequency and amplitude. , 1971, Acta physiologica Scandinavica.

[11]  Roy D. Patterson,et al.  A Dynamic Compressive Gammachirp Auditory Filterbank , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[15]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[16]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[17]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[18]  Masashi Unoki,et al.  Modulation Spectral Features for Predicting Vocal Emotion Recognition by Simulated Cochlear Implants , 2016, INTERSPEECH.

[19]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[20]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[21]  Björn W. Schuller,et al.  Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[22]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[23]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[24]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[25]  R. Patterson,et al.  Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. , 1995, The Journal of the Acoustical Society of America.