Speech Emotion Recognition With Early Visual Cross-modal Enhancement Using Spiking Neural Networks

Speech emotion recognition (SER) is an important part of affective computing and signal processing research areas. A number of approaches, especially deep learning techniques, have achieved promising results on SER. However, there are still challenges in translating temporal and dynamic changes in emotions through speech. Spiking Neural Networks (SNN) have demonstrated as a promising approach in machine learning and pattern recognition tasks such as handwriting and facial expression recognition. In this paper, we investigate the use of SNNs for SER tasks and more importantly we propose a new cross-modal enhancement approach. This method is inspired by the auditory information processing in the brain where auditory information is preceded, enhanced and predicted by a visual processing in multisensory audio-visual processing. We have conducted experiments on two datasets to compare our approach with the state-of-the-art SER techniques in both uni-modal and multi-modal aspects. The results have demonstrated that SNNs can be an ideal candidate for modeling temporal relationships in speech features and our cross-modal approach can significantly improve the accuracy of SER.

[1]  Ursula Hess,et al.  The influence of context on emotion recognition in humans , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[2]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[3]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[4]  Yafeng Niu,et al.  Improvement on Speech Emotion Recognition Based on Deep Convolutional Neural Networks , 2018, ICCAI 2018.

[5]  P. Aruna,et al.  Applying Machine Learning Techniques for Speech Emotion Recognition , 2018, 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[6]  Jon Sánchez,et al.  Exploring Fusion Methods and Feature Space for the Classification of Paralinguistic Information , 2017, INTERSPEECH.

[7]  Wendi B. Heinzelman,et al.  Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification , 2016, International Journal of Speech Technology.

[8]  Björn W. Schuller,et al.  Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition , 2017, IEEE Signal Processing Letters.

[9]  Rubén D. Fonnegra,et al.  Speech Emotion Recognition Integrating Paralinguistic Features and Auto-encoders in a Deep Learning Model , 2018, HCI.

[10]  Deepak Khosla,et al.  Spiking Deep Convolutional Neural Networks for Energy-Efficient Object Recognition , 2014, International Journal of Computer Vision.

[11]  Zhong-Qiu Wang,et al.  Speech emotion recognition based on Gaussian Mixture Models and Deep Neural Networks , 2017, 2017 Information Theory and Applications Workshop (ITA).

[12]  Charles Spence,et al.  Multisensory enhancement elicited by unconscious visual stimuli , 2017, Experimental Brain Research.

[13]  Matthew Cook,et al.  Unsupervised learning of digit recognition using spike-timing-dependent plasticity , 2015, Front. Comput. Neurosci..

[14]  Steve B. Furber,et al.  Deep Spiking Neural Network model for time-variant signals classification: a real-time speech recognition approach , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[15]  Laura Caponetti,et al.  Speech Emotion Recognition Using Spiking Neural Networks , 2006, ISMIS.

[16]  Hesham Mostafa,et al.  Supervised Learning Based on Temporal Coding in Spiking Neural Networks , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Turgut Özseven,et al.  Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition , 2018, Applied Acoustics.

[18]  Elia Formisano,et al.  Multisensory Integration in Speech Processing: Neural Mechanisms of Cross-Modal Aftereffects , 2017 .

[19]  Adrian K. C. Lee,et al.  Integration of Visual Information in Auditory Cortex Promotes Auditory Scene Analysis through Multisensory Binding , 2017, Neuron.

[20]  Yi-Ping Phoebe Chen,et al.  Acoustic Features Extraction for Emotion Recognition , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[21]  Sethuraman Panchanathan,et al.  Multimodal emotion recognition using deep learning architectures , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  T. Stanford,et al.  Development of multisensory integration from the perspective of the individual neuron , 2014, Nature Reviews Neuroscience.

[23]  C. Vinola,et al.  A Survey on Human Emotion Recognition Approaches, Databases and Applications , 2015 .

[24]  Kuzma Strelnikov,et al.  Brain Prediction of Auditory Emphasis by Facial Expressions During Audiovisual Continuous Speech , 2013, Brain Topography.

[25]  Nikola Kasabov,et al.  Dynamic evolving spiking neural networks for on-line spatio- and spectro-temporal pattern recognition. , 2013, Neural networks : the official journal of the International Neural Network Society.

[26]  J. Amudha,et al.  A Survey on Spiking Neural Networks in Image Processing , 2014, ISI.

[27]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[28]  Kaushik Roy,et al.  STDP Based Unsupervised Multimodal Learning With Cross-Modal Processing in Spiking Neural Networks , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[29]  John J. Foxe,et al.  Multisensory auditory-visual interactions during early sensory processing in humans: a high-density electrical mapping study. , 2002, Brain research. Cognitive brain research.

[30]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[31]  Juan Ye,et al.  Bio-Inspired Spiking Neural Networks for Facial Expression Recognition: Generalisation Investigation , 2018, TPNC.

[32]  William Curran,et al.  An Event Driven Fusion Approach for Enjoyment Recognition in Real-time , 2014, ACM Multimedia.

[33]  Sergio Escalera,et al.  Audio-Visual Emotion Recognition in Video Clips , 2019, IEEE Transactions on Affective Computing.

[34]  Stefan Pollmann,et al.  Investigating the brain basis of facial expression perception using multi-voxel pattern analysis , 2015, Cortex.

[35]  Björn W. Schuller,et al.  An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech , 2017, ACM Multimedia.

[36]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[37]  Fillia Makedon,et al.  Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition , 2017, Comput..

[38]  Ross K. Maddox,et al.  Integration of visual information in auditory cortex promotes auditory scene analysis through multisensory binding , 2017 .

[39]  Romain Brette,et al.  Neuroinformatics Original Research Article Brian: a Simulator for Spiking Neural Networks in Python , 2022 .

[40]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[41]  Yongzhao Zhan,et al.  Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis , 2017, Multimedia Systems.

[42]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[43]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[44]  Philippe Gournay,et al.  Biologically inspired speech emotion recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Claudio Gallicchio,et al.  Deep reservoir computing: A critical experimental analysis , 2017, Neurocomputing.

[46]  Sonja A. Kotz,et al.  Dynamic Facial Expressions Prime the Processing of Emotional Prosody , 2018, Front. Hum. Neurosci..

[47]  Daniel J. Saunders,et al.  STDP Learning of Image Patches with Convolutional Spiking Neural Networks , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[48]  Hananel Hazan,et al.  Unsupervised Learning with Self-Organizing Spiking Neural Networks , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[49]  Sonja A. Kotz,et al.  On the role of crossmodal prediction in audiovisual emotion perception , 2013, Front. Hum. Neurosci..

[50]  Poonam Bansal,et al.  The State of the Art of Feature Extraction Techniques in Speech Recognition , 2018 .

[51]  Haizhou Li,et al.  A Biologically Plausible Speech Recognition Framework Based on Spiking Neural Networks , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[52]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[53]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[54]  Shiqing Zhang,et al.  Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech , 2012, MMSP 2012.

[55]  Kishor B. Bhangale,et al.  Sound based human emotion recognition using MFCC & multiple SVM , 2017, 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC).