Automatic Speech Emotion Recognition: A survey

The area of Automatic Speech Emotion Recognition (ASER) has garnered a lot of interest among researchers. The framework of ASER typically includes three steps viz. speech feature extraction, dimensionality reduction and feature classification. At the base of this framework lies the design and recording of the database of emotional states through which the most popular set of emotions-happiness, sadness, anger, fear, disgust, boredom (which are typically called as `archetypal emotions') and neutral among others have been obtained. This paper surveys the extent of work done in this field especially highlighting the three steps of the ASER framework. Starting with the different languages that have been explored till date for creating the databases, this paper attempts to categorize the features that have been typically extracted, enlist the dimensionality reduction techniques that have been chosen and discuss the pros and cons, if any, of the feature classifiers that have been modelled.

[1]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[2]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[3]  Santosh Chapaneri,et al.  Emotion Recognition from Speech using Teager based DSCC Features , 2013 .

[4]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[5]  Yi-Ping Phoebe Chen,et al.  Acoustic Features Extraction for Emotion Recognition , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[6]  Ji Xi,et al.  Practical Speech Emotion Recognition Based on Online Learning: From Acted Data to Elicited Data , 2013 .

[7]  Sartra Wongthanavasu,et al.  Speech emotion recognition using Support Vector Machines , 2013, 2013 5th International Conference on Knowledge and Smart Technology (KST).

[8]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  Li-chiung Yang,et al.  The expression and recognition of emotions through prosody , 2000, INTERSPEECH.

[11]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[12]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[13]  Kornel Laskowski,et al.  Emotion Recognition in Spontaneous Speech , 2009 .

[14]  Carlo Drioli,et al.  Emotions and voice quality: experiments with sinusoidal modeling , 2003 .

[15]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[16]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[17]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Yi-Ping Phoebe Chen,et al.  Acoustic feature selection for automatic emotion recognition from speech , 2009, Inf. Process. Manag..

[19]  K S Rao,et al.  Emotion recognition from speech signal using epoch parameters , 2010, 2010 International Conference on Signal Processing and Communications (SPCOM).

[20]  Åsa Abelin,et al.  Cross linguistic interpretation of emotional prosody , 2002 .

[21]  Shrikanth Narayanan,et al.  Recognition of negative emotions from the speech signal , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[22]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[23]  Tsang-Long Pao,et al.  Detecting Emotions in Mandarin Speech , 2004, ROCLING/IJCLCLP.

[24]  Marko Lugger,et al.  AN INCREMENTAL ANALYSIS OF DIFFERENT FEATURE GROUPS IN SPEAKER INDEPENDENT EMOTION RECOGNITION , 2007 .

[25]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[26]  Shashidhar G. Koolagudi,et al.  Identification of Hindi Dialects and Emotions using Spectral and Prosodic features of Speech , 2013 .

[27]  Inma Hernáez,et al.  Feature Analysis and Evaluation for Automatic Emotion Identification in Speech , 2010, IEEE Transactions on Multimedia.

[28]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[29]  Björn Schuller,et al.  The Automatic Recognition of Emotions in Speech , 2011 .

[30]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[31]  Björn W. Schuller,et al.  Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote? , 2011, INTERSPEECH.

[32]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[33]  Aurobinda Routray,et al.  Vocal emotion recognition in five native languages of Assam using new wavelet features , 2009, Int. J. Speech Technol..

[34]  Mondher Frikha,et al.  Cepstrum vs. LPC: A Comparative Study for Speech Formant Frequencies Estimation , 2006 .

[35]  Yixiong Pan,et al.  SPEECH EMOTION RECOGNITION USING SUPPORT VECTOR MACHINE , 2010 .

[36]  Mohamed S. Kamel,et al.  An Efficient Greedy Method for Unsupervised Feature Selection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[37]  Santosh Chapaneri,et al.  Spoken Digits Recognition using Weighted MFCC and Improved Features for Dynamic Time Warping , 2012 .

[38]  Qiong Duan,et al.  Speech Emotion Recognition Using Gaussian Mixture Model , 2012 .

[39]  K. Scherer,et al.  Mapping emotions into acoustic space: The role of voice production , 2011, Biological Psychology.

[40]  Mitsuru Ishizuka,et al.  Textual Affect Sensing for Sociable and Expressive Online Communication , 2007, ACII.

[41]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[42]  Hyun-Chul Kim,et al.  Constructing support vector machine ensemble , 2003, Pattern Recognit..

[43]  Fakhri Karray,et al.  Speech Emotion Recognition using Gaussian Mixture Vector Autoregressive Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[44]  Johannes Wagner,et al.  From Physiological Signals to Emotions: Implementing and Comparing Selected Methods for Feature Extraction and Classification , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[45]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).