A closed-form solution to the graph total variation problem for continuous emotion profiling in noisy environment

Abstract Time-continuous emotion estimation (e. g., arousal and valence) from spontaneous speech expressions has recently drawn increasing commercial attention. However, real-life applications of emotion recognition technology require challenging conditions, such as noise from recording devices and background environments. In this work, we introduce a novel personalized emotion prediction model validated in different noisy environments. It is performed by a three-level noise reduction algorithm: (i) data downsampling, (ii) feature synchronization, and (iii) a modified version of graph total variation. The approach has been validated on the broadly used RECOLA database with different types of noises, including convolutive and additive noise with different SNRs. The process of feature synchronization improves the concordance correlation coefficient (CCC) absolute values by 0.271 on average for arousal and 0.137 for valence. The proposed denoising approach further improves the values by 0.101 for arousal and 0.086 for valence. Finally, the proposed model considerably improves the CCC values on raw data and all types of noisy data and outperforms the standard denoising methods.

[1]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[2]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[3]  Xiao Mingli,et al.  A Noise Reduction Method Based on LMS Adaptive Filter of Audio Signals , 2013, ICMT 2013.

[4]  José M. F. Moura,et al.  Discrete signal processing on graphs: Graph filters , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Björn Schuller,et al.  Emotion recognition in the noise applying large acoustic feature sets , 2006, Speech Prosody 2006.

[6]  Fabien Ringeval,et al.  Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models , 2017, IEEE Transactions on Affective Computing.

[7]  Hong Yan,et al.  Text-Independent Phoneme Segmentation Combining EGG and Speech Data , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Lijiang Chen,et al.  Speech emotion recognition: Features and classification models , 2012, Digit. Signal Process..

[9]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[10]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[11]  Fabien Ringeval,et al.  Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks , 2016, INTERSPEECH.

[12]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[13]  Tariq S. Durrani,et al.  Nonlinear signal processing for vocal folds damage detection based on heterogeneous sensor network , 2016, Signal Process..

[14]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[15]  Erik Marchi,et al.  Recent developments and results of ASC-Inclusion: An Integrated Internet-Based Environment for Social Inclusion of Children with Autism Spectrum Conditions , 2015, IUI 2015.

[16]  Amit Sharma,et al.  Speech Emotion Recognition , 2015 .

[17]  Shrikanth S. Narayanan,et al.  Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Chengwei Huang,et al.  Speech Emotion Recognition under White Noise , 2013 .

[19]  Fabien Ringeval,et al.  Automatic Analysis of Typical and Atypical Encoding of Spontaneous Emotion in the Voice of Children , 2016, INTERSPEECH.

[20]  Mohan M. Trivedi,et al.  2010 International Conference on Pattern Recognition Speech Emotion Analysis in Noisy Real-World Environment , 2022 .

[21]  R. Bonner,et al.  Application of wavelet transforms to experimental spectra : Smoothing, denoising, and data set compression , 1997 .

[22]  Fabien Ringeval,et al.  An emotional modulation model as signature for the identification of children developmental disorders , 2018, Scientific Reports.

[23]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[24]  Israel Cohen,et al.  A deep architecture for audio-visual voice activity detection in the presence of transients , 2018, Signal Process..

[25]  José M. F. Moura,et al.  Discrete Signal Processing on Graphs: Frequency Analysis , 2013, IEEE Transactions on Signal Processing.

[26]  José M. F. Moura,et al.  Signal denoising on graphs via graph filtering , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[27]  S. R. Mahadeva Prasanna,et al.  Feature optimisation for stress recognition in speech , 2016, Pattern Recognit. Lett..

[28]  James R. Hopgood,et al.  Robust indoor speaker recognition in a network of audio and video sensors , 2016, Signal Process..

[29]  Sebastian Ewert,et al.  The Audio Degradation Toolbox and Its Application to Robustness Evaluation , 2013, ISMIR.

[30]  Arianna Mencattini,et al.  Strength Is in Numbers: Can Concordant Artificial Listeners Improve Prediction of Emotion from Speech? , 2016, PloS one.

[31]  Friedhelm Schwenker,et al.  Emotion recognition from speech signals via a probabilistic echo-state network , 2015, Pattern Recognit. Lett..

[32]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[33]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[34]  Long Zhang,et al.  Supervised single-channel speech dereverberation and denoising using a two-stage model based sparse representation , 2018, Speech Commun..

[35]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.

[36]  Björn W. Schuller,et al.  Channel mapping using bidirectional long short-term memory for dereverberation in hands-free voice controlled devices , 2014, IEEE Transactions on Consumer Electronics.

[37]  José M. F. Moura,et al.  Discrete Signal Processing on Graphs , 2012, IEEE Transactions on Signal Processing.

[38]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[39]  Nadia Bianchi-Berthouze,et al.  Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models , 2011, ACII.

[40]  Dianne P. O'Leary,et al.  Near-Optimal Parameters for Tikhonov and Other Regularization Methods , 2001, SIAM J. Sci. Comput..