Feature Pooling of Modulation Spectrum Features for Improved Speech Emotion Recognition in the Wild

Interest in affective computing is burgeoning, in great part due to its role in emerging affective human-computer interfaces (HCI). To date, the majority of existing research on automated emotion analysis has relied on data collected in controlled environments. With the rise of HCI applications on mobile devices, however, so-called “in-the-wild” settings have posed a serious threat for emotion recognition systems, particularly those based on voice. In this case, environmental factors such as ambient noise and reverberation severely hamper system performance. In this paper, we quantify the detrimental effects that the environment has on emotion recognition and explore the benefits achievable with speech enhancement. Moreover, we propose a modulation spectral feature pooling scheme that is shown to outperform a state-of-the-art benchmark system for environment-robust prediction of spontaneous arousal and valence emotional primitives. Experiments on an environment-corrupted version of the RECOLA dataset of spontaneous interactions show the proposed feature pooling scheme, combined with speech enhancement, outperforming the benchmark across different noise-only, reverberation-only and noise-plus-reverberation conditions. Additional tests with the SEWA database show the benefits of the proposed method for in-the-wild applications.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2]  Paul Lukowicz,et al.  Activity and emotion recognition to support early diagnosis of psychiatric diseases , 2008, 2008 Second International Conference on Pervasive Computing Technologies for Healthcare.

[3]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[4]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[5]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[6]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[8]  Fabien Ringeval,et al.  Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks , 2016, INTERSPEECH.

[9]  Björn W. Schuller,et al.  Recognition of Nonprototypical Emotions in Reverberated and Noisy Speech by Nonnegative Matrix Factorization , 2011, EURASIP J. Adv. Signal Process..

[10]  M. Pell,et al.  Emotional Speech Processing at the Intersection of Prosody and Semantics , 2012, PloS one.

[11]  Rosalind W. Picard Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[12]  Mohan M. Trivedi,et al.  2010 International Conference on Pattern Recognition Speech Emotion Analysis in Noisy Real-World Environment , 2022 .

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[16]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[18]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[19]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[20]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[21]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[22]  Paulo A. M. Kanda,et al.  Towards an EEG-based biomarker for Alzheimer's disease: Improving amplitude modulation analysis features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Erik Marchi,et al.  Recent developments and results of ASC-Inclusion: An Integrated Internet-Based Environment for Social Inclusion of Children with Autism Spectrum Conditions , 2015, IUI 2015.

[24]  Hatice Gunes,et al.  Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[25]  T. Dau,et al.  Characterizing frequency selectivity for envelope fluctuations. , 2000, The Journal of the Acoustical Society of America.

[26]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[27]  Tiago H. Falk,et al.  Temporal Dynamics for Blind Measurement of Room Acoustical Parameters , 2010, IEEE Transactions on Instrumentation and Measurement.

[28]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[29]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[30]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[31]  Anil K. Jain,et al.  Artificial Neural Networks: A Tutorial , 1996, Computer.

[32]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[33]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[34]  Emanuel A. P. Habets,et al.  Late reverberation PSD estimation for single-channel dereverberation using relative convolutive transfer functions , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[35]  J. Horvat THE ETHICS OF ARTIFICIAL INTELLIGENCE , 2016 .

[36]  Björn Schuller,et al.  Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition , 2016, ACM Multimedia.

[37]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[38]  Peter Vary,et al.  A binaural room impulse response database for the evaluation of dereverberation algorithms , 2009, 2009 16th International Conference on Digital Signal Processing.

[39]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[40]  Tiago H. Falk,et al.  Modulation Spectral Features for Robust Far-Field Speaker Identification , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Xi Li,et al.  Stress and Emotion Classification using Jitter and Shimmer Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[42]  Jonghwa Kim,et al.  Bimodal Emotion Recognition using Speech and Physiological Changes , 2007 .