A multimodal hierarchical approach to speech emotion recognition from audio and text

Abstract Speech emotion recognition (SER) plays a crucial role in improving the quality of man–machine interfaces in various fields like distance learning, medical science, virtual assistants, and automated customer services. A deep learning-based hierarchical approach is proposed for both unimodal and multimodal SER systems in this work. Of these, the audio-based unimodal system proposes using a combination of 33 features, which include prosody, spectral, and voice quality-based audio features. Further, for the multimodal system, both the above-mentioned audio features and additional textual features are used. Embeddings from Language Models v2 (ELMo v2) is implemented to extract word and character embeddings which helped to capture the context-dependent aspects of emotion in text. The proposed models’ performances are evaluated on two audio-only unimodal datasets – SAVEE and RAVDESS, and one audio-text multimodal dataset – IEMOCAP. The proposed hierarchical models offered SER accuracies of 81.2%, 81.7%, and 74.5% on the RAVDESS, SAVEE, and IEMOCAP datasets, respectively. Further, these results are also benchmarked against recently reported techniques, and the reported performances are found to be superior. Therefore, based on the presented investigations, it is concluded that the application of a deep learning-based network in a hierarchical manner significantly improves SER over generic unimodal and multimodal systems.

[1]  Dina Wonsever,et al.  Unraveling Antonym's Word Vectors through a Siamese-like Network , 2019, ACL.

[2]  Johannes Wagner,et al.  Exploring Fusion Methods for Multimodal Emotion Recognition with Missing Data , 2011, IEEE Transactions on Affective Computing.

[3]  Kongqiao Wang,et al.  EEG-based emotion recognition using an end-to-end regional-asymmetric convolutional neural network , 2020, Knowl. Based Syst..

[4]  Erik Cambria,et al.  Multimodal Sentiment Analysis: Addressing Key Issues and Setting Up the Baselines , 2018, IEEE Intelligent Systems.

[5]  Kaya Oguz,et al.  Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers , 2020, Speech Commun..

[6]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[7]  Loïc Kessous,et al.  Multimodal emotion recognition from expressive faces, body gestures and speech , 2007, AIAI.

[8]  Chellu Chandra Sekhar,et al.  HMM Based Intermediate Matching Kernel for Classification of Sequential Patterns of Speech Using Support Vector Machines , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[10]  Bart de Boer,et al.  Introducing Parselmouth: A Python interface to Praat , 2018, J. Phonetics.

[11]  Aniruddha Kanhe,et al.  Performance Comparison of Different Cepstral Features for Speech Emotion Recognition , 2018, 2018 International CET Conference on Control, Communication, and Computing (IC4).

[12]  Margaret Lech,et al.  Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[13]  Sartra Wongthanavasu,et al.  Speech emotion recognition using Support Vector Machines , 2013, 2013 5th International Conference on Knowledge and Smart Technology (KST).

[14]  Ruiyu Liang,et al.  Speech Emotion Classification Using Attention-Based LSTM , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[16]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Jiahui Pan,et al.  Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN , 2020, Speech Commun..

[18]  Shiqing Zhang,et al.  Emotion Recognition in Chinese Natural Speech by Combining Prosody and Voice Quality Features , 2008, ISNN.

[19]  Jisha John,et al.  Facial Expression Recognition System for Visually Impaired , 2018, International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018.

[20]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[21]  Evaggelos Spyrou,et al.  Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms , 2019, Technologies.

[22]  Agnes Jacob Speech emotion recognition based on minimal voice quality features , 2016, 2016 International Conference on Communication and Signal Processing (ICCSP).

[23]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[24]  Erik Cambria,et al.  Sentiment Analysis and Topic Recognition in Video Transcriptions , 2021, IEEE Intelligent Systems.

[25]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  Gabrielle K. Liu Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech , 2018, ArXiv.

[27]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[28]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Maie Bachmann,et al.  Audiovisual emotion recognition in wild , 2018, Machine Vision and Applications.

[30]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  Masato Akagi,et al.  Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning , 2020, APSIPA Transactions on Signal and Information Processing.

[33]  Honghua Tan,et al.  Facial and Speech Recognition Emotion in Distance Education System , 2007 .

[34]  Erik Cambria,et al.  Fuzzy commonsense reasoning for multimodal sentiment analysis , 2019, Pattern Recognit. Lett..

[35]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[36]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[37]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[38]  Y. Soeta Psychophysiological Evidence of an Autocorrelation Mechanism in the Human Auditory System , 2017 .

[39]  Xuan Zhang,et al.  DV-Based Extensible Video Transport Terminal for Standard-Definition Video Conference , 2007 .

[40]  Min Wu,et al.  Speech emotion recognition based on an improved brain emotion learning model , 2018, Neurocomputing.

[41]  Hyung-Jeong Yang,et al.  Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network , 2020, IEEE Access.

[42]  Chung-Hsien Wu,et al.  Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[43]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[44]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[45]  Junaid Qadir,et al.  Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art , 2020, IEEE Reviews in Biomedical Engineering.

[46]  Thierry Pun,et al.  Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.

[47]  Mohammad Shahadat Hossain,et al.  Speech Emotion Recognition in Neurological Disorders Using Convolutional Neural Network , 2020, BI.

[48]  Tanmoy Dasgupta,et al.  Multimodal System for Emotion Recognition Using EEG and Customer Review , 2020 .

[49]  Erik Cambria,et al.  A survey on deep reinforcement learning for audio-based applications , 2021, Artificial Intelligence Review.

[50]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[51]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[52]  Yixiong Pan,et al.  SPEECH EMOTION RECOGNITION USING SUPPORT VECTOR MACHINE , 2010 .

[53]  Huadong Li,et al.  Deep reinforcement learning for robust emotional classification in facial expression recognition , 2020, Knowl. Based Syst..

[54]  Pradip Sircar,et al.  Continuous wavelet transform based speech emotion recognition , 2016, 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS).

[55]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[56]  Rajiv Ratn Shah,et al.  Bagged support vector machines for emotion recognition from speech , 2019, Knowl. Based Syst..

[57]  Zhang Yi,et al.  Spectrogram based multi-task audio classification , 2017, Multimedia Tools and Applications.

[58]  R. Lazarus Relational meaning and discrete emotions. , 2001 .

[59]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Hasan Demirel,et al.  3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms , 2019, Entropy.

[62]  Yongzhao Zhan,et al.  A novel hierarchical speech emotion recognition method based on improved DDAGSVM , 2010, Comput. Sci. Inf. Syst..

[63]  Emmanuel Dellandréa,et al.  Automatic Hierarchical Classification of Emotional Speech , 2008 .

[64]  Robert I. Damper,et al.  Multi-class and hierarchical SVMs for emotion recognition , 2010, INTERSPEECH.

[65]  Koteswara Rao Anne,et al.  A comparative analysis of classifiers in emotion recognition through acoustic features , 2014, Int. J. Speech Technol..

[66]  Björn W. Schuller,et al.  Emotion on the Road - Necessity, Acceptance, and Feasibility of Affective Computing in the Car , 2010, Adv. Hum. Comput. Interact..

[67]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Héctor M. Pérez Meana,et al.  Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques , 2012, CONIELECOMP 2012, 22nd International Conference on Electrical Communications and Computers.

[69]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.