Time-Frequency Feature Representation Using Multi-Resolution Texture Analysis and Acoustic Activity Detector for Real-Life Speech Emotion Recognition

The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.

[1]  Tony Ezzat,et al.  Localized spectro-temporal cepstral analysis of speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[3]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[4]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[5]  K. Laws Textured Image Segmentation , 1980 .

[6]  Arivazhagan Selvaraj,et al.  Texture classification using wavelet transform , 2003, Pattern Recognit. Lett..

[7]  Stéphane Mallat,et al.  Multifrequency channel decompositions of images and wavelet models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[8]  C.-C. Jay Kuo,et al.  Texture analysis and classification with tree-structured wavelet transform , 1993, IEEE Trans. Image Process..

[9]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[10]  Chin-Teng Lin,et al.  Using Fuzzy Inference and Cubic Curve to Detect and Compensate Backlight Image , 2006 .

[11]  L.C. De Silva,et al.  Speech based emotion classification , 2001, Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239).

[12]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[13]  Brian Scassellati,et al.  Humanoid Robots: A New Kind of Tool , 2000, IEEE Intell. Syst..

[14]  Tsang-Long Pao,et al.  Emotion recognition from Mandarin speech signals , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[15]  Valery A. Petrushin,et al.  Emotion recognition in speech signal: experimental study, development, and application , 2000, INTERSPEECH.

[16]  Ralf Kompe,et al.  Emotional space improves emotion recognition , 2002, INTERSPEECH.

[17]  F. Lopez-Ferreras,et al.  Application of Fisher Linear Discriminant Analysis to Speech/Music Classification , 2006, EUROCON 2005 - The International Conference on "Computer as a Tool".

[18]  Georges Quénot,et al.  Recognizing emotions for the audio-visual document indexing , 2004, Proceedings. ISCC 2004. Ninth International Symposium on Computers And Communications (IEEE Cat. No.04TH8769).

[19]  Margaret Lech,et al.  Emotion Recognition in Speech of Parents of Depressed Adolescents , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[20]  William M. Hartmann,et al.  Psychoacoustics: Facts and Models , 2001 .

[21]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  Steven J. Simske,et al.  Recognition of emotions in interactive voice response systems , 2003, INTERSPEECH.

[23]  Chin-Teng Lin,et al.  Word boundary detection with mel-scale frequency bank in noisy environment , 2000, IEEE Trans. Speech Audio Process..

[24]  Yoon Keun Kwak,et al.  Emotion interactive robot focus on speaker independently emotion recognition , 2007, 2007 IEEE/ASME international conference on advanced intelligent mechatronics.

[25]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[26]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[27]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[28]  Tony Ezzat,et al.  Discriminative word-spotting using ordered spectro-temporal patch features , 2008, SAPA@INTERSPEECH.

[29]  Kwee-Bo Sim,et al.  Emotion Recognition Based on Frequency Analysis of Speech Signal , 2002, Int. J. Fuzzy Log. Intell. Syst..

[30]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[31]  Volker Hohmann,et al.  Sub-band SNR estimation using auditory feature processing , 2003, Speech Commun..

[32]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[33]  Kun-Ching Wang,et al.  Wavelet-based voice activity detection algorithm in variable-level noise environment , 2009 .

[34]  Joseph Picone,et al.  Applications of support vector machines to speech recognition , 2004, IEEE Transactions on Signal Processing.

[35]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[36]  Lianhong Cai,et al.  Speech emotion classification with the combination of statistic features and temporal features , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[37]  Nicholas B. Allen,et al.  Recognition of stress in speech using wavelet analysis and Teager energy operator , 2008, INTERSPEECH.