Affect-salient event sequence modelling for continuous speech emotion recognition

Abstract Continuous speech emotion recognition, which faces the problems of delay caused by annotators’ reaction time and noise caused by non-emotional segments, is a challenging subject in the field of affective computing. To solve these problems, we propose a new affect-salient event sequence modelling (ASESM) method based on connectionist temporal classification (CTC). This method treats a sentence’s label as a chain of affect-salient event (ASE) and non-affect-salient event Null states rather than continuous emotional value. With this representation, a CTC-based convolutional neural network (CNN) is built to automatically label the sentence’s emotional segments with ASE and non-emotional segments with Null, so as to reduce the impact of noise caused by non-emotional segments. Furthermore, we propose an event probability vector decoding (EPVD) algorithm to search the optimal ASE sequence from the CTC loss matrix and mark the occurrence time of each event within this sequence. Then, the arousal and valence ground-truth annotations of each ASE are used to represent the continuous emotional value of a segment which is predicted as the ASE. Since the ground-truth annotations of each ASE have contained different time-delays, taking events as the target can avoid the additional reaction delay compensation. We test our method on the RECOLA and AVEC 2014 benchmark databases. The experimental results demonstrate that the proposed event-based method can improve the performance of continuous emotion recognition and the improvement is more obvious when the selected ASE has high annotation consistency.

[1]  Eduardo Coutinho,et al.  Dynamic Difficulty Awareness Training for Continuous Emotion Prediction , 2018, IEEE Transactions on Multimedia.

[2]  Fabien Ringeval,et al.  AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition , 2018, AVEC@MM.

[3]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[4]  Tanaya Guha,et al.  Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions , 2014, AVEC '14.

[5]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[6]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[7]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[8]  Dongmei Jiang,et al.  Sequence-to-sequence Modelling for Categorical Speech Emotion Recognition Using Recurrent Neural Network , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[9]  Ismail Shahin,et al.  Emotion Recognition Using Hybrid Gaussian Mixture Model and Deep Neural Network , 2019, IEEE Access.

[10]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[11]  Ting Dang,et al.  Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018 , 2018, AVEC@MM.

[12]  Fabien Ringeval,et al.  Automatic Recognition of Affective Laughter in Spontaneous Dyadic Interactions from Audiovisual Signals , 2018, ICMI.

[13]  Björn W. Schuller,et al.  End-to-End Speech Emotion Recognition Using Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yifan Gong,et al.  Advancing Connectionist Temporal Classification with Attention Modeling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[16]  Juan Li,et al.  Review of data features-based music emotion recognition methods , 2017, Multimedia Systems.

[17]  Philip N. Garner,et al.  Self-Attention for Speech Emotion Recognition , 2019, INTERSPEECH.

[18]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[19]  Björn W. Schuller,et al.  Towards Temporal Modelling of Categorical Speech Emotion Recognition , 2018, INTERSPEECH.

[20]  Klaus R. Scherer,et al.  Vocal markers of emotion: Comparing induction and acting elicitation , 2013, Comput. Speech Lang..

[21]  Tatsuya Kawahara,et al.  Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC , 2017, INTERSPEECH.

[22]  Reza Lotfian,et al.  Formulating emotion perception as a probabilistic model with application to categorical emotion classification , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[23]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[24]  Björn W. Schuller,et al.  Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[26]  Reza Lotfian,et al.  Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[28]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[29]  Björn Schuller,et al.  Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition , 2019, INTERSPEECH.

[30]  Lianhong Cai,et al.  Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms , 2018, INTERSPEECH.

[31]  Yu Zheng,et al.  Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition , 2018, INTERSPEECH.

[32]  Mohamed Chetouani,et al.  Robust continuous prediction of human emotions using multiscale dynamic cues , 2012, ICMI '12.

[33]  Vidhyasaharan Sethu,et al.  Demonstrating and Modelling Systematic Time-varying Annotator Disagreement in Continuous Emotion Annotation , 2018, INTERSPEECH.

[34]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Emily Mower Provost,et al.  Jointly Aligning and Predicting Continuous Emotion Annotations , 2019, ArXiv.

[36]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[37]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[38]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[39]  Jianwu Dang,et al.  Dimensional Emotion Recognition from Speech Using Modulation Spectral Features and Recurrent Neural Networks , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).