Speech emotion recognition: Features and classification models

To solve the speaker independent emotion recognition problem, a three-level speech emotion recognition model is proposed to classify six speech emotions, including sadness, anger, surprise, fear, happiness and disgust from coarse to fine. For each level, appropriate features are selected from 288 candidates by using Fisher rate which is also regarded as input parameter for Support Vector Machine (SVM). In order to evaluate the proposed system, principal component analysis (PCA) for dimension reduction and artificial neural network (ANN) for classification are adopted to design four comparative experiments, including Fisher+SVM, PCA+SVM, Fisher+ANN, PCA+ANN. The experimental results proved that Fisher is better than PCA for dimension reduction, and SVM is more expansible than ANN for speaker independent speech emotion recognition. The average recognition rates for each level are 86.5%, 68.5% and 50.2% respectively.

[1]  N. Fox,et al.  If it's not left, it's right. Electroencephalograph asymmetry and the development of emotion. , 1991, The American psychologist.

[2]  Tieniu Tan,et al.  Affective Computing: A Review , 2005, ACII.

[3]  Emiliano Lorini,et al.  A New Look at the Semantics and Optimization Methods of CP-Networks , 2003, IJCAI.

[4]  Russell Beale,et al.  Affect and Emotion in Human-Computer Interaction, From Theory to Applications , 2008, Affect and Emotion in Human-Computer Interaction.

[5]  Yonghong Yan,et al.  Acoustic Feature Optimization Based on F-Ratio for Robust Speech Recognition , 2010, IEICE Trans. Inf. Syst..

[6]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[7]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[8]  Yi Luo,et al.  Speech emotion recognition based on a hybrid of HMM/ANN , 2007 .

[9]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[10]  P. Ekman Are there basic emotions? , 1992, Psychological review.

[11]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[12]  J. J. Moré,et al.  Levenberg--Marquardt algorithm: implementation and theory , 1977 .

[13]  Lijiang Chen,et al.  Multi-level Speech Emotion Recognition Based on HMM and ANN , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[14]  Brigitte Krenn,et al.  Fully generated scripted dialogue for embodied agents , 2008, Artif. Intell..

[15]  Peter Robinson,et al.  Classification of Complex Information: Inference of Co-Occurring Affective States from Their Expressions in Speech , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Lijiang Chen,et al.  Speech Emotion Recognition Based on Parametric Filter and Fractal Dimension , 2010, IEICE Trans. Inf. Syst..

[17]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[18]  R. Fisher THE STATISTICAL UTILIZATION OF MULTIPLE MEASUREMENTS , 1938 .

[19]  Björn W. Schuller,et al.  Evolutionary Feature Generation in Speech Emotion Recognition , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[20]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[21]  Kyu-Sik Park,et al.  A Study of Emotion Recognition and Its Applications , 2007, MDAI.

[22]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[23]  Tieniu Tan,et al.  Affective Information Processing , 2008 .

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  Robert Porzel Contextual Computing: Models and Applications , 2010 .

[26]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.