Affective Burst Detection from Speech using Kernel-fusion Dilated Convolutional Neural Networks

As speech-interfaces are getting richer and widespread, speech emotion recognition promises more attractive applications. In the continuous emotion recognition (CER) problem, tracking changes across affective states is an important and desired capability. Although CER studies widely use correlation metrics in evaluations, these metrics do not always capture all the high-intensity changes in the affective domain. In this paper, we define a novel affective burst detection problem to accurately capture high-intensity changes of the affective attributes. For this problem, we formulate a two-class classification approach to isolate affective burst regions over the affective state contour. The proposed classifier is a kernel-fusion dilated convolutional neural network (KFDCNN) architecture driven by speech spectral features to segment the affective attribute contour into idle and burst sections. Experimental evaluations are performed on the RECOLA and CreativeIT datasets. The proposed KFDCNN is observed to outperform baseline feedforward neural networks on both datasets.

[1]  Carlos Busso,et al.  The USC CreativeIT database of multimodal dyadic interactions: from speech and full body motion capture to continuous emotional annotations , 2015, Language Resources and Evaluation.

[2]  M. Latinus,et al.  Facial Expression Related vMMN: Disentangling Emotional from Neutral Change Detection , 2017, Frontiers in human neuroscience.

[3]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tong Zhang,et al.  Multi-cue fusion for emotion recognition in the wild , 2018, Neurocomputing.

[5]  Björn W. Schuller,et al.  End-to-End Speech Emotion Recognition Using Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Thamer Alhussain,et al.  Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[7]  Engin Erzin,et al.  Multimodal Continuous Emotion Recognition using Deep Multi-Task Learning with Correlation Loss , 2020, ArXiv.

[8]  H. Schlosberg Three dimensions of emotion. , 1954, Psychological review.

[9]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[10]  Partha Pratim Roy,et al.  End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[11]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[12]  Carlos Busso,et al.  Predicting Emotionally Salient Regions using Qualitative Agreement of Deep Neural Network Regressors , 2018 .

[13]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[14]  Joseph E LeDoux Cognitive-Emotional Interactions in the Brain , 1989 .

[15]  R. P. O'Shea,et al.  The quest for the genuine visual mismatch negativity (vMMN): Event-related potential indications of deviance detection for low-level visual features. , 2020, Psychophysiology.

[16]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[17]  J. Hinojosa,et al.  N170 sensitivity to facial expression: A meta-analysis , 2015, Neuroscience & Biobehavioral Reviews.

[18]  Björn Schuller,et al.  Continuous Emotion Recognition in Speech - Do We Need Recurrence? , 2019, INTERSPEECH.

[19]  I. Czigler,et al.  Visual mismatch negativity and stimulus-specific adaptation: the role of stimulus complexity , 2019, Experimental Brain Research.

[20]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Eliathamby Ambikairajah,et al.  An investigation of emotion change detection from speech , 2015, INTERSPEECH.

[22]  Stefanos Zafeiriou,et al.  End-to-end multimodal affect recognition in real-world environments , 2021, Inf. Fusion.

[23]  Ioannis Pitas,et al.  Facial expression analysis under partial occlusion , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[25]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).