Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition

Abstract In this paper, a novel approach, which is based on attention guided 3D convolutional neural networks (CNN)-long short-term memory (LSTM) model, is proposed for speech based emotion recognition. The proposed attention guided 3D CNN-LSTM model is trained in end-to-end fashion. The input speech signals are initially resampled and pre-processed for noise removing and emphasizing the high frequencies. Then, spectrogram, Mel-frequency cepstral coefficient (MFCC), cochleagram and fractal dimension methods are used to convert the input speech signals into the speech images. The obtained images are concatenated into four-dimensional volumes and used as input to the developed 28 layered attention integrated 3D CNN-LSTM model. In the 3D CNN-LSTM model, there are six 3D convolutional layers, two batch normalization (BN) layers, five Rectified Linear Unit (ReLu) layers, three 3D max pooling layers, one attention, one LSTM, one flatten and one dropout layers, and two fully connected layers. The attention layer is connected to the 3D convolution layers. Three datasets namely Ryerson Audio-Visual Database of Emotional Speech (RAVDESS), RML and SAVEE are used in the experimental works. Besides, the mixture of these datasets is also used in the experimental works. Classification accuracy, sensitivity, specificity and F1-score are used for evaluation of the developed method. The obtained results are also compared with some of the recently published results and it is seen that the proposed method outperforms the compared methods.

[1]  Juan Ye,et al.  Speech Emotion Recognition With Early Visual Cross-modal Enhancement Using Spiking Neural Networks , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[2]  D. Şengür,et al.  Efficient approach for EEG‐based emotion recognition , 2020, Electronics Letters.

[3]  Thompson,et al.  Fractal sandstone pores: Implications for conductivity and pore formation. , 1985, Physical review letters.

[4]  Maie Bachmann,et al.  Audiovisual emotion recognition in wild , 2018, Machine Vision and Applications.

[5]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[6]  K. YogeshC.,et al.  A new hybrid PSO assisted biogeography-based optimization for emotion and stress recognition from speech signal , 2017, Expert Syst. Appl..

[7]  Mohamed Mbarki,et al.  Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO , 2019, Speech Commun..

[8]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[9]  Jesús B. Alonso,et al.  Feature selection for spontaneous speech analysis to aid in Alzheimer's disease diagnosis: A fractal dimension approach , 2015, Comput. Speech Lang..

[10]  Sergio Escalera,et al.  Audio-Visual Emotion Recognition in Video Clips , 2019, IEEE Transactions on Affective Computing.

[11]  Abdulkadir Sengur,et al.  Classification of Lung Sounds With CNN Model Using Parallel Pooling Structure , 2020, IEEE Access.

[12]  Anders Friberg,et al.  Idealized Computational Models for Auditory Receptive Fields , 2014, PloS one.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  T. Higuchi Approach to an irregular time series on the basis of the fractal theory , 1988 .

[15]  Srdjan Kesic,et al.  Application of Higuchi's fractal dimension from basic to clinical neurophysiology: A review , 2016, Comput. Methods Programs Biomed..

[16]  Gintautas Dzemyda,et al.  Speech emotion classification using fractal dimension-based features , 2019, Nonlinear Analysis: Modelling and Control.

[17]  Hasan Demirel,et al.  3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms , 2019, Entropy.

[18]  Yunus Korkmaz,et al.  Unsupervised and supervised VAD systems using combination of time and frequency domain features , 2020, Biomed. Signal Process. Control..

[19]  Sachin Taran,et al.  Surface EMG signals and deep transfer learning-based physical action classification , 2019, Neural Computing and Applications.

[20]  A. Milton,et al.  Improved speech emotion recognition with Mel frequency magnitude coefficient , 2021 .

[21]  Adnan Yazici,et al.  Speech emotion recognition with deep convolutional neural networks , 2020, Biomed. Signal Process. Control..

[22]  U. Rajendra Acharya,et al.  Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques , 2021, Knowl. Based Syst..

[23]  Turgut Özseven,et al.  Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition , 2018, Applied Acoustics.

[24]  PRIYANKA M,et al.  CROSS CORPUS SPEECH EMOTION RECOGNITION , 2019, 2019 International Conference on Recent Advances in Energy-efficient Computing and Communication (ICRAECC).

[25]  Arthur Petrosian,et al.  Kolmogorov complexity of finite sequences and recognition of different preictal EEG patterns , 1995, Proceedings Eighth IEEE Symposium on Computer-Based Medical Systems.

[26]  Lamiaa Abdel-Hamid,et al.  Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features , 2020, Speech Commun..

[27]  Ilyas Ozer,et al.  Pseudo-colored rate map representation for speech emotion recognition , 2021, Biomed. Signal Process. Control..

[28]  Bozena Kostek,et al.  A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces , 2020, Electronics.

[29]  Björn W. Schuller,et al.  Self-attention transfer networks for speech emotion recognition , 2021, Virtual Real. Intell. Hardw..

[30]  A. Şengür,et al.  Efficient COVID-19 Segmentation from CT Slices Exploiting Semantic Segmentation with Integrated Attention Mechanism , 2021, Journal of Digital Imaging.

[31]  Paulo Menezes,et al.  Speaker Awareness for Speech Emotion Recognition , 2020, Int. J. Online Biomed. Eng..

[32]  Efthymios Tzinis,et al.  Integrating Recurrence Dynamics for Speech Emotion Recognition , 2018, INTERSPEECH.

[33]  Li Liu,et al.  Wavelet packet analysis for speaker-independent emotion recognition , 2020, Neurocomputing.

[34]  Muhammad Sajjad,et al.  Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM , 2020, IEEE Access.

[35]  Sakorn Mekruksavanich,et al.  Negative Emotion Recognition using Deep Learning for Thai Language , 2020, 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON).

[36]  Jianwu Dang,et al.  Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends , 2020, IEEE Access.

[37]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[38]  Mustaqeem,et al.  Att-Net: Enhanced emotion recognition system using lightweight self-attention module , 2021, Appl. Soft Comput..

[39]  A. Aertsen,et al.  Spectro-temporal receptive fields of auditory neurons in the grassfrog , 1980, Biological Cybernetics.

[40]  Paolo Castiglioni,et al.  What is wrong in Katz's method? Comments on: "A note on fractal dimensions of biomedical waveforms" , 2010, Comput. Biol. Medicine.

[41]  Xing Chen,et al.  Human emotion recognition by optimally fusing facial expression and speech feature , 2020, Signal Process. Image Commun..

[42]  Yassine Ben Ayed,et al.  Speech Emotion Recognition with deep learning , 2020, KES.

[43]  Chutham Sawigun,et al.  Analog complex gammatone filter for cochlear implant channels , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[44]  Mustaqeem,et al.  A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition , 2019, Sensors.

[45]  Turgut Özseven,et al.  A novel feature selection method for speech emotion recognition , 2019, Applied Acoustics.

[46]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[47]  Jiahui Pan,et al.  Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN , 2020, Speech Commun..

[48]  Varun Bajaj,et al.  Emotion recognition from single-channel EEG signals using a two-stage correlation and instantaneous frequency-based filtering method , 2019, Comput. Methods Programs Biomed..

[49]  Xiangmin Xu,et al.  Spatiotemporal and frequential cascaded attention networks for speech emotion recognition , 2021, Neurocomputing.

[50]  Thaweesak Yingthawornsuk,et al.  Speech Recognition using MFCC , 2012 .

[51]  S. J. Kabudian,et al.  Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier , 2020 .

[52]  Abdulkadir Şengür,et al.  Deep Learning and Audio Based Emotion Recognition , 2019, 2019 International Artificial Intelligence and Data Processing Symposium (IDAP).

[53]  Roger K. Moore,et al.  Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition , 2019, INTERSPEECH.

[54]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[55]  Fu Lee Wang,et al.  Speech emotion recognition based on DNN-decision tree SVM model , 2019, Speech Commun..