Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models

The advancements in neural networks and the on-demand need for accurate and near real-time Speech Emotion Recognition (SER) in human–computer interactions make it mandatory to compare available methods and databases in SER to achieve feasible solutions and a firmer understanding of this open-ended problem. The current study reviews deep learning approaches for SER with available datasets, followed by conventional machine learning techniques for speech emotion recognition. Ultimately, we present a multi-aspect comparison between practical neural network approaches in speech emotion recognition. The goal of this study is to provide a survey of the field of discrete speech emotion recognition.

[1]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[2]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[3]  Jaime Cerda Jacobo,et al.  An improved characterization methodology to efficiently deal with the speech emotion recognition problem , 2017, 2017 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC).

[4]  Paul Dalsgaard,et al.  Design, recording and verification of a danish emotional speech database , 1997, EUROSPEECH.

[5]  Colleen Richey,et al.  Emotion detection in speech using deep networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Saurabh Sahu,et al.  On Enhancing Speech Emotion Recognition using Generative Adversarial Networks , 2018, INTERSPEECH.

[7]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  David A. van Leeuwen,et al.  Speech-based recognition of self-reported and observed emotion in a dimensional space , 2012, Speech Commun..

[9]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Sunil Kumar Kopparapu,et al.  Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  A. Mehrabian Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[15]  M. Landau Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk , 2008 .

[16]  Bayya Yegnanarayana,et al.  Analysis of Emotional Speech - A Review , 2016, Toward Robotic Socially Believable Behaving Systems.

[17]  Levent M. Arslan,et al.  Automatic Detection of Anger in Human-Human Call Center Dialogs , 2011, INTERSPEECH.

[18]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[19]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[20]  Yvonne Rogers,et al.  Being Human: Human-Computer Interaction in the Year 2020 , 2019 .

[21]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[22]  David Philippou-Hübner,et al.  Vowels Formants Analysis Allows Straightforward Detection of High Arousal Acted and Spontaneous Emotions , 2011, INTERSPEECH.

[23]  Margaret McRorie,et al.  The Belfast Induced Natural Emotion Database , 2012, IEEE Transactions on Affective Computing.

[24]  Diego Reforgiato Recupero,et al.  A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning , 2020, Applied Intelligence.

[25]  Asif Ekbal,et al.  How Intense Are You? Predicting Intensities of Emotions and Sentiments using Stacked Ensemble [Application Notes] , 2020, IEEE Comput. Intell. Mag..

[26]  Siddharth Saxena,et al.  Emotion Recognition and Classification in Speech using Artificial Neural Networks , 2016 .

[27]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[28]  Arnab Bag,et al.  A review on emotion recognition using speech , 2017, 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT).

[29]  Erik Cambria,et al.  Sentic Computing: Exploitation of Common Sense for the Development of Emotion-Sensitive Systems , 2009, COST 2102 Training School.

[30]  R. V. Darekar,et al.  Emotion recognition from Marathi speech database using adaptive artificial neural network , 2018, BICA 2018.

[31]  Gang Wei,et al.  Speech emotion recognition based on HMM and SVM , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[32]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[33]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[34]  Dietmar F. Rösner,et al.  Inducing Genuine Emotions in Simulated Speech-Based Human-Machine Interaction: The NIMITEK Corpus , 2010, IEEE Transactions on Affective Computing.

[35]  Kavita Bhatnagar,et al.  Extending the Neural Model to Study the Impact of Effective Area of Optical Fiber on Laser Intensity , 2017 .

[36]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[37]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[38]  Sartra Wongthanavasu,et al.  Speech emotion recognition using Support Vector Machines , 2013, 2013 5th International Conference on Knowledge and Smart Technology (KST).

[39]  Ruiyu Liang,et al.  Speech Emotion Classification Using Attention-Based LSTM , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[41]  M. Kathleen Pichora-Fuller,et al.  Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set , 2011 .

[42]  peng song,et al.  Transfer Linear Subspace Learning for Cross-Corpus Speech Emotion Recognition , 2019, IEEE Transactions on Affective Computing.

[43]  Wendi B. Heinzelman,et al.  Unsupervised Learning Approach to Feature Analysis for Automatic Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Adel Said Elmaghraby,et al.  Emotion analysis from speech using temporal contextual trajectories , 2014, 2014 IEEE Symposium on Computers and Communications (ISCC).

[45]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[46]  Rajib Maity,et al.  Hybrid Deep Learning Approach for Multi-Step-Ahead Daily Rainfall Prediction Using GCM Simulations , 2020, IEEE Access.

[47]  Narendra Ahuja,et al.  Cresceptron: a self-organizing neural network which grows adaptively , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[48]  Erchin Serpedin,et al.  Deep Learning Detection of Electricity Theft Cyber-Attacks in Renewable Distributed Generation , 2020, IEEE Transactions on Smart Grid.

[49]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[50]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[51]  Peng Song,et al.  Speech Emotion Recognition Using Transfer Learning , 2014, IEICE Trans. Inf. Syst..

[52]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[53]  Noam Amir,et al.  Classifying emotions in speech: a comparison of methods , 2001, INTERSPEECH.

[54]  Shrikanth Narayanan,et al.  Data Augmentation Using GANs for Speech Emotion Recognition , 2019, INTERSPEECH.

[55]  Robert I. Damper,et al.  Multi-class and hierarchical SVMs for emotion recognition , 2010, INTERSPEECH.

[56]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[57]  John H. L. Hansen,et al.  Sentiment extraction from natural audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Erik Cambria,et al.  SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis , 2020, CIKM.

[59]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[60]  Kaya Oguz,et al.  Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers , 2020, Speech Commun..

[61]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[62]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[63]  Pascale Fung,et al.  A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[65]  Adel Said Elmaghraby,et al.  Speech emotion detection using time dependent self organizing maps , 2013, IEEE International Symposium on Signal Processing and Information Technology.

[66]  Chia-Ping Chen,et al.  Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  David Philippou-Hübner,et al.  Determining optimal features for emotion recognition from speech by applying an evolutionary algorithm , 2010, INTERSPEECH.

[68]  Yihai He,et al.  Root cause analysis approach based on reverse cascading decomposition in QFD and fuzzy weight ARM for quality accidents , 2020, Comput. Ind. Eng..

[69]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[70]  Rajib Rana,et al.  Variational Autoencoders for Learning Latent Representations of Speech Emotion , 2017, INTERSPEECH.

[71]  Gwenn Englebienne,et al.  Towards Speech Emotion Recognition "in the Wild" Using Aggregated Corpora and Deep Multi-Task Learning , 2017, INTERSPEECH.

[72]  Björn W. Schuller,et al.  Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[73]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Simone D. J. Barbosa,et al.  Introduction to Human-Computer Interaction , 2018, CHI Extended Abstracts.

[75]  Mustaqeem,et al.  Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features , 2020, Sensors.

[76]  Ryohei Nakatsu,et al.  Emotion recognition and its application to computer agents with spontaneous interactive capabilities , 1999, MULTIMEDIA '99.

[77]  G. Bansal,et al.  A Review on Emotion Detection and Classification using Speech , 2020 .

[78]  Ruili Wang,et al.  Ensemble methods for spoken emotion recognition in call-centres , 2007, Speech Commun..

[79]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[80]  Thamer Alhussain,et al.  Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[81]  Yihai He,et al.  Big data oriented root cause identification approach based on Axiomatic domain mapping and weighted association rule mining for product infant failure , 2017, Comput. Ind. Eng..

[82]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[84]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[85]  Rajib Rana,et al.  Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness , 2018, ArXiv.

[86]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[87]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[88]  Sakorn Mekruksavanich,et al.  Negative Emotion Recognition using Deep Learning for Thai Language , 2020, 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON).