Front-End Feature Compensation and Denoising for Noise Robust Speech Emotion Recognition

Front-end processing is one of the ways to impart noise robustness to speech emotion recognition systems in mismatched scenarios. Here, we implement and compare different frontend robustness techniques for their efficacy in speech emotion recognition. First, we use a feature compensation technique based on the Vector Taylor Series (VTS) expansion of noisy Mel-Frequency Cepstral Coefficents (MFCCs). Next, we improve upon the feature compensation technique by using the VTS expansion with auditory masking formulation. We have also looked into the applicability of 10-root compression in MFCC computation. Further, a Time Delay Neural Network based Denoising Autoencoder (TDNN-DAE) is implemented to estimate the clean MFCCs from the noisy MFCCs. These techniques have not been investigated yet for their suitability to robust speech emotion recognition task. The performance of these front-end techniques are compared with the Non-Negative Matrix Factorization (NMF) based front-end. Relying on extensive experiments done on two standard databases (EmoDB and IEMOCAP), contaminated with 5 types of noise, we show that these techniques provide significant performance gain in emotion recognition task. We also show that along with front-end compensation, applying feature selection to non-MFCC highlevel descriptors results in better performance.

[1]  Björn Schuller,et al.  Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition , 2016, ACM Multimedia.

[2]  Thambipillai Srikanthan,et al.  Psychoacoustic Model Compensation for Robust Speaker Verification in Environmental Noise , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Ashish Panda A fast approach to psychoacoustic model compensation for robust speaker recognition in additive noise , 2015, INTERSPEECH.

[4]  Mohamed Hesham Farouk Emotion Recognition from Speech , 2014 .

[5]  Björn W. Schuller,et al.  Recognition of Nonprototypical Emotions in Reverberated and Noisy Speech by Nonnegative Matrix Factorization , 2011, EURASIP J. Adv. Signal Process..

[6]  Lukasz Juszkiewicz,et al.  Improving Noise Robustness of Speech Emotion Recognition System , 2013, IDC.

[7]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[8]  Sunil Kumar Kopparapu,et al.  Robust Front-End Processing For Emotion Recognition In Noisy Speech , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[9]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Biswajit Das,et al.  Enhanced Denoising Auto-Encoder for Robust Speech Recognition in Unseen Noise Conditions , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[11]  Yannis Stylianou,et al.  Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder , 2017, INTERSPEECH.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Björn Schuller,et al.  Emotion Recognition in the Noise Applying Large Acoustic Feature Sets , 2006 .

[14]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[15]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[17]  Khe Chai Sim,et al.  Noise adaptive front-end normalization based on Vector Taylor Series for Deep Neural Networks in robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  David V. Anderson,et al.  Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing , 2006, SAPA@INTERSPEECH.

[19]  Biswajit Das,et al.  Integrating Denoising Autoencoder and Vector Taylor Series with Auditory Masking for Speech Recognition in Noisy Conditions , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[20]  Lijiang Chen,et al.  Speech emotion recognition: Features and classification models , 2012, Digit. Signal Process..

[21]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[22]  Biswajit Das,et al.  Robust front-end processing for Speech Recognition in noisy conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Chengwei Huang,et al.  Speech Emotion Recognition under White Noise , 2013 .

[24]  Sunil Kumar Kopparapu,et al.  An Unsupervised frame Selection Technique for Robust Emotion Recognition in Noisy Speech , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[25]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.