Vocal Emotion Recognition with Log-Gabor Filters

Vocal emotion recognition aims to identify the emotional states of speakers by analyzing their speech signal. This paper builds on the work of Ezzat, Bouvrie and Poggio by performing a spectro-temporal analysis of affective vocalizations by decomposing the associated spectrogram with 2D Gabor filters. Based on the previous studies of the emotion expression in voices and the turn out display in spectrogram, we assumed that each vocal emotion has a unique spectro-temporal signature in terms of orientated energy bands which can be detected by properly tuned Gabor filters. We compared the emotion-recognition performances of tuned log-Gabor filters with standard acoustic features. The experimental results show that applying pairs of log-Gabor filters to extract features from the spectrogram yields a performance that matches the performance of an approach based on traditional acoustic features. Their combined emotion recognition performance outperforms state-of-the-art vocal emotion recognition algorithms. This leads us to conclude that tuned log-Gabor filters support the automatic recognition of emotions from speech and may be beneficial to other speech-related tasks.

[1]  Jody Kreiman,et al.  Foundations of Voice Studies: An Interdisciplinary Approach to Voice Production and Perception , 2011 .

[2]  Zied Lachiri,et al.  Environmental Sounds Classification Based on Visual Features , 2011, CIARP.

[3]  Zhaohui Wu,et al.  MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[4]  Tony Ezzat,et al.  Spectro-temporal analysis of speech using 2-d Gabor filters , 2007, INTERSPEECH.

[5]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[6]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[7]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[8]  Eric O. Postma,et al.  The log-Gabor method: speech classification using spectrogram image analysis , 2012, INTERSPEECH.

[9]  Lijiang Chen,et al.  Speech emotion recognition: Features and classification models , 2012, Digit. Signal Process..

[10]  D. Gabor,et al.  Theory of communication. Part 1: The analysis of information , 1946 .

[11]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12]  Volker Hohmann,et al.  Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency , 2011, Speech Commun..

[13]  G. Victo Sudha George,et al.  Review on Feature Selection Techniques and the Impact of SVM for Cancer Classification using Gene Expression Profile , 2011, ArXiv.

[14]  K. Hammerschmidt,et al.  Acoustical correlates of affective prosody. , 2007, Journal of voice : official journal of the Voice Foundation.

[15]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[16]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[17]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[18]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[19]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.