Emotion recognition from speech under environmental noise conditions using wavelet decomposition

Automatic emotion recognition considering speech signals has attracted the attention of the research community in the last years. One of the main challenges is to find suitable features to represent the affective state of the speaker. In this paper, a new set of features derived from the wavelet packet transform is proposed to classify different negative emotions such as anger, fear, and disgust, and to differentiate between those negative emotions and neutral state, or positive emotions such as happiness. Different wavelet decompositions are considered both for voiced and unvoiced segments, in order to determine a frequency band where the emotions are concentrated. Several measures are calculated in the wavelet decomposed signals, including log-energy, entropy measures, mel frequency cepstral coefficients, and the Lempel-Ziv complexity. The experiments consider two different databases extensively used in emotion recognition: the Berlin emotional database, and the enterface05 database. Also, in order to approximate to real-world conditions in terms of the quality of recorded speech, such databases are degraded using different environmental noise such as cafeteria babble, and street noise. The addition of noise is performed considering several signal to noise ratio levels which range from -3 to 6 dB. Finally, the effect produced by two different speech enhancement methods is evaluated. According to results, the features calculated from the lower frequency wavelet decomposition coefficients are able to recognize the fear-type emotions in speech. Also, one of the speech enhancement algorithms has proven to be useful to improve of the accuracy in cases of speech signals affected by highly background noise.

[1]  Paavo Alku,et al.  Multi-scale modulation filtering in automatic detection of emotions in telephone speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[3]  Shaila Apte,et al.  Emotion modeling from speech signal based on wavelet packet transform , 2013, Int. J. Speech Technol..

[4]  Chloé Clavel,et al.  Fear-type emotion recognition for future audio-based surveillance systems , 2008, Speech Commun..

[5]  Jesús Francisco Vargas-Bonilla,et al.  Evaluation of wavelet measures on automatic detection of emotion in noisy and telephony speech signals , 2014, 2014 International Carnahan Conference on Security Technology (ICCST).

[6]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[7]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[8]  Yi Hu,et al.  A generalized subspace approach for enhancing speech corrupted by colored noise , 2003, IEEE Trans. Speech Audio Process..

[9]  Björn W. Schuller,et al.  The Computational Paralinguistics Challenge [Social Sciences] , 2012, IEEE Signal Processing Magazine.

[10]  Jesús Francisco Vargas-Bonilla,et al.  Non-linear Dynamics Characterization from Wavelet Packet Transform for Automatic Recognition of Emotional Speech , 2016, Recent Advances in Nonlinear Speech Processing.

[11]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[12]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[14]  Yongming Huang,et al.  Speech Emotion Recognition Based on Coiflet Wavelet Packet Cepstral Coefficients , 2014, CCPR.