A Comprehensive Review of Speech Emotion Recognition Systems

During the last decade, Speech Emotion Recognition (SER) has emerged as an integral component within Human-computer Interaction (HCI) and other high-end speech processing systems. Generally, an SER system targets the speaker’s existence of varied emotions by extracting and classifying the prominent features from a preprocessed speech signal. However, the way humans and machines recognize and correlate emotional aspects of speech signals are quite contrasting quantitatively and qualitatively, which present enormous difficulties in blending knowledge from interdisciplinary fields, particularly speech emotion recognition, applied psychology, and human-computer interface. The paper carefully identifies and synthesizes recent relevant literature related to the SER systems’ varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic. Furthermore, while scrutinizing the current state of understanding on SER systems, the research gap’s prominence has been sketched out for consideration and analysis by other related researchers, institutions, and regulatory bodies.

[1]  Mira Kartiwi,et al.  Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks , 2020, 2020 6th International Conference on Wireless and Telematics (ICWT).

[2]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[3]  D. Mürbe,et al.  Formant frequencies and bandwidths of the vocal tract transfer function are affected by the mechanical impedance of the vocal tract wall , 2014, Biomechanics and Modeling in Mechanobiology.

[4]  Subhasmita Sahoo,et al.  Study of feature combination using HMM and SVM for multilingual Odiya speech emotion recognition , 2015, International Journal of Speech Technology.

[5]  Mustaqeem,et al.  A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition , 2019, Sensors.

[6]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[7]  Ning Wang,et al.  Speech Emotion Recognition Using Local and Global Features , 2017, BI.

[8]  Saeid Nahavandi,et al.  Wind power forecasting using emotional neural networks , 2014, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[9]  Gang Liu,et al.  Feature Fusion of Speech Emotion Recognition Based on Deep Learning , 2018, 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC).

[10]  Sartra Wongthanavasu,et al.  Speech emotion recognition using Support Vector Machines , 2013, 2013 5th International Conference on Knowledge and Smart Technology (KST).

[11]  Bhiksha Raj,et al.  On the Origin of Deep Learning , 2017, ArXiv.

[12]  T. Ozseven Evaluation of the Effect of Frame Size on Speech Emotion Recognition , 2018, 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT).

[13]  H. Marvi,et al.  A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation , 2014 .

[14]  Tan Lee,et al.  Revisiting Hidden Markov Models for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Teddy Surya Gunawan,et al.  Speech Emotion Recognition Using Deep Feedforward Neural Network , 2018 .

[16]  Jiahui Pan,et al.  Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN , 2020, Speech Commun..

[17]  Gholamreza Anbarjafari,et al.  Automatic speech based emotion recognition using paralinguistics features , 2019 .

[18]  Fraser W. Smith,et al.  Identifying and detecting facial expressions of emotion in peripheral vision , 2018, PloS one.

[19]  Nilesh R. Patel,et al.  Implementation and Comparison of Speech Emotion Recognition System Using Gaussian Mixture Model (GMM) and K- Nearest Neighbor (K-NN) Techniques , 2015 .

[20]  Björn Schuller,et al.  Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition , 2016, ACM Multimedia.

[21]  Kosai Raoof,et al.  Automatic Speech Emotion Recognition Using Machine Learning , 2019, Social Media and Machine Learning.

[22]  Siddharth Saxena,et al.  Emotion Recognition and Classification in Speech using Artificial Neural Networks , 2016 .

[23]  Ngoc Thang Vu,et al.  Improving Speech Emotion Recognition with Unsupervised Representation Learning on Unlabeled Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Nawel Yala,et al.  Towards improving feature extraction and classification for activity recognition on streaming data , 2017, J. Ambient Intell. Humaniz. Comput..

[25]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[26]  Danyang Li,et al.  Random Deep Belief Networks for Recognizing Emotions from Speech Signals , 2017, Comput. Intell. Neurosci..

[27]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Bashar M. Nema,et al.  Preprocessing signal for Speech Emotion Recognition , 2018, Al-Mustansiriyah Journal of Science.

[29]  Enes Yuncu,et al.  Automatic Speech Emotion Recognition Using Auditory Models with Binary Decision Tree and SVM , 2014, 2014 22nd International Conference on Pattern Recognition.

[30]  Thamer Alhussain,et al.  Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[31]  Ragini Verma,et al.  Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech , 2015, Comput. Speech Lang..

[32]  Kaya Oguz,et al.  Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers , 2020, Speech Commun..

[33]  Guanzheng Tan,et al.  An Improved Brain-Inspired Emotional Learning Algorithm for Fast Classification , 2017, Algorithms.

[34]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[35]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[36]  Min Wu,et al.  Speech emotion recognition based on an improved brain emotion learning model , 2018, Neurocomputing.

[37]  Weishan Zhang,et al.  Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN , 2017, Sensors.

[38]  Andries Petrus Engelbrecht,et al.  Feature Reduction for Dimensional Emotion Recognition in Human-Robot Interaction , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[39]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Arnab Bag,et al.  A review on emotion recognition using speech , 2017, 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT).

[41]  Akshay Deepak,et al.  DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features , 2018, Circuits, Systems, and Signal Processing.

[42]  Divya Gupta,et al.  An analysis on LPC, RASTA and MFCC techniques in Automatic Speech recognition system , 2016, 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence).

[43]  T. Kishore Kumar,et al.  Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC , 2017, 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[44]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[45]  Chloé Clavel,et al.  Fear-type emotions of the SAFE Corpus: annotation issues , 2006, LREC.

[46]  Giancarlo Fortino,et al.  Human emotion recognition using deep belief network architecture , 2019, Inf. Fusion.

[47]  Chengxin Li,et al.  Speech emotion recognition with acoustic and lexical features , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[49]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[50]  Zhong-Qiu Wang,et al.  Speech emotion recognition based on Gaussian Mixture Models and Deep Neural Networks , 2017, 2017 Information Theory and Applications Workshop (ITA).

[51]  A. Routray,et al.  Emotion recognition from Assamese speeches using MFCC features and GMM classifier , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[52]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[53]  D. Govind,et al.  Development of simulated emotion speech database for excitation source analysis , 2017, Int. J. Speech Technol..

[54]  Wang Ruchuan,et al.  Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning , 2020, Journal of Signal Processing Systems.

[55]  Muhammad Sajjad,et al.  Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM , 2020, IEEE Access.

[56]  E. Nöth,et al.  Releasing a thoroughly annotated and processed spontaneous emotional database : the FAU Aibo Emotion Corpus , 2008 .

[57]  Rajib Rana,et al.  Direct Modelling of Speech Emotion from Raw Speech , 2019, INTERSPEECH.

[58]  Shashidhar G. Koolagudi,et al.  Recognition of Emotions from Speech using Excitation Source Features , 2012 .

[59]  Christian Balkenius,et al.  EMOTIONAL LEARNING: A COMPUTATIONAL MODEL OF THE AMYGDALA , 2001, Cybern. Syst..

[60]  Monita Chatterjee,et al.  Voice emotion recognition by cochlear-implanted children and their normally-hearing peers , 2015, Hearing Research.

[61]  Min Chen,et al.  Emotion Communication System , 2017, IEEE Access.

[62]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[63]  Jithendra Vepa,et al.  Speech Emotion Recognition Using Spectrogram & Phoneme Embedding , 2018, INTERSPEECH.

[64]  Fu Lee Wang,et al.  Speech emotion recognition based on DNN-decision tree SVM model , 2019, Speech Commun..

[65]  Figen Ertaş,et al.  FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[66]  Arti V. Bang,et al.  Emotion recognition on the basis of audio signal using Naive Bayes classifier , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[67]  Indu Kashyap,et al.  Machine Learning Classifiers , 2020, Big Data, IoT, and Machine Learning.

[68]  Teddy Surya Gunawan,et al.  A critical insight into multi-languages speech emotion databases , 2019 .

[69]  Eugene Semenkin,et al.  Multi-Objective Heuristic Feature Selection for Speech-Based Multilingual Emotion Recognition , 2016, J. Artif. Intell. Soft Comput. Res..

[70]  Zhe Gan,et al.  Learning Deep Sigmoid Belief Networks with Data Augmentation , 2015, AISTATS.

[71]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[72]  Fu Wang,et al.  Decision tree SVM model with Fisher feature selection for speech emotion recognition , 2019, EURASIP J. Audio Speech Music. Process..

[73]  F. A.,et al.  Automatic Emotion Recognition from Speech Using Artificial Neural Networks with Gender-Dependent Databases , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[74]  Gholamreza Anbarjafari,et al.  Vocal-based emotion recognition using random forests and decision tree , 2017, International Journal of Speech Technology.

[75]  Kosai Raoof,et al.  Speech Emotion Recognition: Methods and Cases Study , 2018, ICAART.

[76]  Poonam Bansal,et al.  The State of the Art of Feature Extraction Techniques in Speech Recognition , 2018 .

[77]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[78]  J. Ababneh Application of Naïve Bayes, Decision Tree, and K-Nearest Neighbors for Automated Text Classification , 2019, Modern Applied Science.

[79]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[80]  Yixiong Pan,et al.  SPEECH EMOTION RECOGNITION USING SUPPORT VECTOR MACHINE , 2010 .

[81]  Malay Kishore Dutta,et al.  Speech emotion recognition with deep learning , 2017, 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).

[82]  Chia-Ping Chen,et al.  Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  S. Bhandari,et al.  Speech Emotion Recognition using MFCC features and LSTM network , 2019, 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA).

[84]  Ani Nenkova,et al.  Emotion Impacts Speech Recognition Performance , 2019, NAACL.

[85]  Jianwu Dang,et al.  A Feature Fusion Method Based on Extreme Learning Machine for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[86]  Weishan Zhang,et al.  Deep learning and SVM‐based emotion recognition from Chinese speech for smart affective services , 2017, Softw. Pract. Exp..

[87]  Enzo Pasquale Scilingo,et al.  Analysis of speech features and personality traits , 2019, Biomed. Signal Process. Control..

[88]  Yanning Zhang,et al.  Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[89]  S. Renjith,et al.  Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters — A comparitive study using KNN and ANN classifiers , 2017, 2017 International Conference on Circuit ,Power and Computing Technologies (ICCPCT).

[90]  James F. Kaiser,et al.  Some useful properties of Teager's energy operators , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[91]  Cigdem Eroglu Erdem,et al.  BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States , 2017, IEEE Transactions on Affective Computing.

[92]  L.C. De Silva,et al.  Detection of stress and emotion in speech using traditional and FFT based log energy features , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[93]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[94]  Fan Zhang,et al.  Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion , 2020, 2020 10th Annual Computing and Communication Workshop and Conference (CCWC).

[95]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[96]  G. Sivaranjani,et al.  EMOTION RECOGNITION FROM SPEECH WITH GAUSSIAN MIXTURE MODELS AND VIA BOOSTED GMM , 2018 .

[97]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[98]  Wenzhen Zhang,et al.  Speech Emotion Recognition Based on SVM and ANN , 2018, International Journal of Machine Learning and Computing.

[99]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[100]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[101]  Hao Li,et al.  Noninvasive fracture characterization based on the classification of sonic wave travel times , 2020 .

[102]  Poonam Kukana,et al.  A Hybrid Machine Learning Model for Emotion Recognition From Speech Signals , 2020, 2020 International Conference on Smart Electronics and Communication (ICOSEC).