论文信息 - Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models

Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models

The advancements in neural networks and the on-demand need for accurate and near real-time Speech Emotion Recognition (SER) in human–computer interactions make it mandatory to compare available methods and databases in SER to achieve feasible solutions and a firmer understanding of this open-ended problem. The current study reviews deep learning approaches for SER with available datasets, followed by conventional machine learning techniques for speech emotion recognition. Ultimately, we present a multi-aspect comparison between practical neural network approaches in speech emotion recognition. The goal of this study is to provide a survey of the field of discrete speech emotion recognition.

Daniel Sierra-Sosa | Adel Said Elmaghraby | Adel Elmaghraby | Babak Joze Abbaschian | Daniel Sierra-Sosa

[1] Jianfeng Zhao,et al. Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[2] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[3] Jaime Cerda Jacobo,et al. An improved characterization methodology to efficiently deal with the speech emotion recognition problem , 2017, 2017 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC).

[4] Paul Dalsgaard,et al. Design, recording and verification of a danish emotional speech database , 1997, EUROSPEECH.

[5] Colleen Richey,et al. Emotion detection in speech using deep networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Saurabh Sahu,et al. On Enhancing Speech Emotion Recognition using Generative Adversarial Networks , 2018, INTERSPEECH.

[7] Shrikanth S. Narayanan,et al. Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[8] David A. van Leeuwen,et al. Speech-based recognition of self-reported and observed emotion in a dimensional space , 2012, Speech Commun..

[9] Seyedmahdad Mirsamadi,et al. Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Bin Yang,et al. The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13] Sunil Kumar Kopparapu,et al. Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] A. Mehrabian. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[15] M. Landau. Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk , 2008 .

[16] Bayya Yegnanarayana,et al. Analysis of Emotional Speech - A Review , 2016, Toward Robotic Socially Believable Behaving Systems.

[17] Levent M. Arslan,et al. Automatic Detection of Anger in Human-Human Call Center Dialogs , 2011, INTERSPEECH.

[18] S. R. Livingstone,et al. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[19] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[20] Yvonne Rogers,et al. Being Human: Human-Computer Interaction in the Year 2020 , 2019 .

[21] Albino Nogueiras,et al. Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[22] David Philippou-Hübner,et al. Vowels Formants Analysis Allows Straightforward Detection of High Arousal Acted and Spontaneous Emotions , 2011, INTERSPEECH.

[23] Margaret McRorie,et al. The Belfast Induced Natural Emotion Database , 2012, IEEE Transactions on Affective Computing.

[24] Diego Reforgiato Recupero,et al. A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning , 2020, Applied Intelligence.

[25] Asif Ekbal,et al. How Intense Are You? Predicting Intensities of Emotions and Sentiments using Stacked Ensemble [Application Notes] , 2020, IEEE Comput. Intell. Mag..

[26] Siddharth Saxena,et al. Emotion Recognition and Classification in Speech using Artificial Neural Networks , 2016 .

[27] Che-Wei Huang,et al. Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[28] Arnab Bag,et al. A review on emotion recognition using speech , 2017, 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT).

[29] Erik Cambria,et al. Sentic Computing: Exploitation of Common Sense for the Development of Emotion-Sensitive Systems , 2009, COST 2102 Training School.

[30] R. V. Darekar,et al. Emotion recognition from Marathi speech database using adaptive artificial neural network , 2018, BICA 2018.

[31] Gang Wei,et al. Speech emotion recognition based on HMM and SVM , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[32] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[33] Ailbhe Ní Chasaide,et al. The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[34] Dietmar F. Rösner,et al. Inducing Genuine Emotions in Simulated Speech-Based Human-Machine Interaction: The NIMITEK Corpus , 2010, IEEE Transactions on Affective Computing.

[35] Kavita Bhatnagar,et al. Extending the Neural Model to Study the Impact of Effective Area of Optical Fiber on Laser Intensity , 2017 .

[36] Björn W. Schuller,et al. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[37] Dong Yu,et al. Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[38] Sartra Wongthanavasu,et al. Speech emotion recognition using Support Vector Machines , 2013, 2013 5th International Conference on Knowledge and Smart Technology (KST).

[39] Ruiyu Liang,et al. Speech Emotion Classification Using Attention-Based LSTM , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40] Ryohei Nakatsu,et al. Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[41] M. Kathleen Pichora-Fuller,et al. Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set , 2011 .

[42] peng song,et al. Transfer Linear Subspace Learning for Cross-Corpus Speech Emotion Recognition , 2019, IEEE Transactions on Affective Computing.

[43] Wendi B. Heinzelman,et al. Unsupervised Learning Approach to Feature Analysis for Automatic Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44] Adel Said Elmaghraby,et al. Emotion analysis from speech using temporal contextual trajectories , 2014, 2014 IEEE Symposium on Computers and Communications (ISCC).

[45] Shrikanth S. Narayanan,et al. The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[46] Rajib Maity,et al. Hybrid Deep Learning Approach for Multi-Step-Ahead Daily Rainfall Prediction Using GCM Simulations , 2020, IEEE Access.

[47] Narendra Ahuja,et al. Cresceptron: a self-organizing neural network which grows adaptively , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[48] Erchin Serpedin,et al. Deep Learning Detection of Electricity Theft Cyber-Attacks in Renewable Distributed Generation , 2020, IEEE Transactions on Smart Grid.

[49] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[50] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[51] Peng Song,et al. Speech Emotion Recognition Using Transfer Learning , 2014, IEICE Trans. Inf. Syst..

[52] Wen Gao,et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[53] Noam Amir,et al. Classifying emotions in speech: a comparison of methods , 2001, INTERSPEECH.

[54] Shrikanth Narayanan,et al. Data Augmentation Using GANs for Speech Emotion Recognition , 2019, INTERSPEECH.

[55] Robert I. Damper,et al. Multi-class and hierarchical SVMs for emotion recognition , 2010, INTERSPEECH.

[56] Valery A. Petrushin,et al. EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[57] John H. L. Hansen,et al. Sentiment extraction from natural audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58] Erik Cambria,et al. SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis , 2020, CIKM.

[59] Ragini Verma,et al. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[60] Kaya Oguz,et al. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers , 2020, Speech Commun..

[61] K. Stevens,et al. Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[62] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[63] Pascale Fung,et al. A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[65] Adel Said Elmaghraby,et al. Speech emotion detection using time dependent self organizing maps , 2013, IEEE International Symposium on Signal Processing and Information Technology.

[66] Chia-Ping Chen,et al. Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67] David Philippou-Hübner,et al. Determining optimal features for emotion recognition from speech by applying an evolutionary algorithm , 2010, INTERSPEECH.

[68] Yihai He,et al. Root cause analysis approach based on reverse cascading decomposition in QFD and fuzzy weight ARM for quality accidents , 2020, Comput. Ind. Eng..

[69] Astrid Paeschke,et al. A database of German emotional speech , 2005, INTERSPEECH.

[70] Rajib Rana,et al. Variational Autoencoders for Learning Latent Representations of Speech Emotion , 2017, INTERSPEECH.

[71] Gwenn Englebienne,et al. Towards Speech Emotion Recognition "in the Wild" Using Aggregated Corpora and Deep Multi-Task Learning , 2017, INTERSPEECH.

[72] Björn W. Schuller,et al. Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[73] Björn W. Schuller,et al. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74] Simone D. J. Barbosa,et al. Introduction to Human-Computer Interaction , 2018, CHI Extended Abstracts.

[75] Mustaqeem,et al. Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features , 2020, Sensors.

[76] Ryohei Nakatsu,et al. Emotion recognition and its application to computer agents with spontaneous interactive capabilities , 1999, MULTIMEDIA '99.

[77] G. Bansal,et al. A Review on Emotion Detection and Classification using Speech , 2020 .

[78] Ruili Wang,et al. Ensemble methods for spoken emotion recognition in call-centres , 2007, Speech Commun..

[79] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[80] Thamer Alhussain,et al. Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[81] Yihai He,et al. Big data oriented root cause identification approach based on Axiomatic domain mapping and weighted association rule mining for product infant failure , 2017, Comput. Ind. Eng..

[82] Björn W. Schuller,et al. Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83] D. Hubel,et al. Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[84] Aurobinda Routray,et al. Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[85] Rajib Rana,et al. Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness , 2018, ArXiv.

[86] Tatsuya Kawahara,et al. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[87] Stefan Steidl,et al. Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[88] Sakorn Mekruksavanich,et al. Negative Emotion Recognition using Deep Learning for Thai Language , 2020, 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON).