Recognizing emotions for the audio-visual document indexing

In this paper, we proposed using MFCC coefficients (mel-scaled cepstral coefficients) and a simple but efficient classifying method: vector quantification (VQ) to perform speaker-dependent emotion recognition. Many other features: energy, pitch, zero crossing, phonetic rate, LPC... and their derivatives are also tested and combined with MFCC coefficients in order to find the best combination. Other models, GMM and HMM (discrete and continuous hidden Markov model), are studied as well in the hope that the use of continuous distribution and the temporal evolution of this set of features will improve the quality of emotion recognition. The accuracy recognizing five different emotions exceeds 80% by using only MFCC coefficients with VQ model. This is a simple but efficient approach, the result is even much better than those obtained with the same database in human evaluations by listening and judging without returning permission nor comparisons between sentences (Inger Samso Engberg and Anya Varnich Hansen, 2001).