For emotion recognition, we selected pitch, log energy, formant, mel-band energies, and mel frequency cepstral coefficients (MFCCs) as the base features, and added velocity/acceleration of pitch and MFCCs to form feature streams. We extracted statistics used for discriminative classifiers, assuming that each stream is a one-dimensional signal. Extracted features were analyzed by using quadratic discriminant analysis (QDA) and support vector machine (SVM). Experimental results showed that pitch and energy were the most important factors. Using two different kinds of databases, we compared emotion recognition performance of various classifiers: SVM, linear discriminant analysis (LDA), QDA and hidden Markov model (HMM). With the text-independent SUSAS database, we achieved the best accuracy of 96.3% for stressed/neutral style classification and 70.1% for 4-class speaking style classification using Gaussian SVM, which is superior to the previous results. With the speaker-independent AIBO database, we achieved 42.3% accuracy for 5-class emotion recognition.
[1]
Ralf Kompe,et al.
Emotional space improves emotion recognition
,
2002,
INTERSPEECH.
[2]
John H. L. Hansen,et al.
Nonlinear feature based classification of speech under stress
,
2001,
IEEE Trans. Speech Audio Process..
[3]
Klaus R. Scherer,et al.
Adding the affective dimension: a new look in speech analysis and synthesis
,
1996,
ICSLP.
[4]
Alex Waibel,et al.
EMOTION-SENSITIVE HUMAN-COMPUTER INTERFACES
,
2000
.
[5]
L B Lusted,et al.
Radiographic applications of receiver operating characteristic (ROC) curves.
,
1974,
Radiology.
[6]
John C. Platt,et al.
Fast training of support vector machines using sequential minimal optimization, advances in kernel methods
,
1999
.