Effects of band reduction and coding on speech emotion recognition

Majority of Speech Emotion Recognition results refer to full-band uncompressed speech signals. Potential applications of SER on various types of speech platforms pose important questions about potential effects of bandwidth limitations and compression techniques used by speech communication systems on the accuracy of SER. The current study provides answers to these questions based on SER experiments with a band-limited speech as well as compressed speech. Compression techniques included AMR, AMR-WB, AMR-WB+ and mp3 methods. The modelling and classification of speech emotions was achieved using a benchmark approach based on the GMM classifier and speech features including MFCCs, TEO and glottal time and frequency domain parameters. The tests used the Berlin Emotional Speech database with speech signals sampled at 16 kHz. The results indicated that the low frequency components (0–1 kHz) of speech as well as, the high frequency components (above 4 kHz) play an important role in SER. The mp3 compression worked better with the MFCC features than with the TEO and glottal parameters. The AMR-WB and AMR-WB+ outperformed the AMR.

[1]  Margaret Lech,et al.  Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[2]  Shashidhar G. Koolagudi,et al.  Emotion Recognition using Speech Features , 2012, Springer Briefs in Electrical and Computer Engineering.

[3]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[4]  Nicholas B. Allen,et al.  Multichannel Weighted Speech Classification System for Prediction of Major Depression in Adolescents , 2013, IEEE Transactions on Biomedical Engineering.

[5]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[6]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[7]  Fausto Pellandini,et al.  GSM speech coding and speaker recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  V. Sridhar,et al.  MPEG-1/2 audio layer-3(MP3) ON THE RISC based ARM PROCESSOR (ARM92SAM9263) , 2012 .

[9]  Nicholas B. Allen,et al.  Detection of Clinical Depression in Adolescents’ Speech During Family Interactions , 2011, IEEE Transactions on Biomedical Engineering.

[10]  Margaret Lech,et al.  On the Correlation and Transferability of Features Between Automatic Speech Recognition and Speech Emotion Recognition , 2016, INTERSPEECH.

[11]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[12]  Nicholas B. Allen,et al.  Detection of stress in speech using perceptual wavelet packet analysis , 2008 .

[13]  Insu Song,et al.  Mental Health Informatics , 2013 .

[14]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[15]  Nicholas B. Allen,et al.  On the importance of glottal flow spectral energy for the recognition of emotions in speech , 2010, INTERSPEECH.

[16]  R. Todeschini,et al.  Multivariate Classification for Qualitative Analysis , 2009 .

[17]  João Paulo Papa,et al.  Spoken emotion recognition through optimum-path forest classification using glottal features , 2010, Comput. Speech Lang..

[18]  Elliot Moore,et al.  Critical Analysis of the Impact of Glottal Features in the Classification of Clinical Depression in Speech , 2008, IEEE Transactions on Biomedical Engineering.

[19]  Margaret Lech,et al.  Effect of speech compression on the automatic recognition of emotions , 2016 .

[20]  Digital Cellular Telecommunications System (phase 2+); Adaptive Multi-rate (amr) Speech Transcoding (gsm 06.90 Version 7.2.1 Release 1998) Global System for Mobile Communications , .

[21]  Rosalind W. Picard,et al.  A computational model for the automatic recognition of affect in speech , 2004 .

[22]  Jing Zhang,et al.  Study of wavelet packet energy entropy for emotion classification in speech and glottal signals , 2013, Other Conferences.

[23]  Nicholas B. Allen,et al.  Stress Detection Using Speech Spectrograms and Sigma-pi Neuron Units , 2009, 2009 Fifth International Conference on Natural Computation.

[24]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[25]  Margaret Lech,et al.  Emotion recognition in natural speech using empirical mode decomposition and Renyi entropy , 2009 .

[26]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[27]  L. He Stress and emotion recognition in natural speech in the work and family environments , 2010 .

[28]  Mark Phythian,et al.  Effects of speech coding on text-dependent speaker recognition , 1997, TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications (Cat. No.97CH36162).

[29]  Matti Airas,et al.  TKK Aparat: An environment for voice inverse filtering and parameterization , 2008, Logopedics, phoniatrics, vocology.

[30]  Eliathamby Ambikairajah,et al.  Analysis of an MFCC-based audio indexing system for efficient coding of multimedia sources , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[31]  Paul Foulkes,et al.  The ?Mobile Phone Effect? on Vowel Formants , 2004 .

[32]  D. Ballabio,et al.  Classification tools in chemistry. Part 1: linear models. PLS-DA , 2013 .

[33]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..