Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Speech emotion recognition is very challenging because the definition of emotion is uncertain and the feature representation is complex. Accurate feature representation is one of the key factors for successful speech emotion recognition. Studies have shown that 3D data composed of static, deltas and delta-deltas of log-Mel spectrum is very effective in filtering irrelevant features. The challenge of speech emotion recognition is also reflected in the necessity of fine-grained classification. Typical applications of affective computing, such as psychological counseling and emotion regulation, require fine-grained emotion recognition. Based on the two inspirations, this paper proposes an end-to-end hierarchical multi-task learning framework, from coarse to fine to achieve fine-grained emotion recognition. Using 3D data as input, in the first stage, we train the coarse emotion type, and then use the result to assist the second stage training for the fine emotion type. By conducting the comparative experiments on the IEMOCAP corpus, we find that the classification idea of coarse-to-fine has a significant performance improvement over the baseline models.

[1]  Mirjana Bonkovic,et al.  Two-level coarse-to-fine classification algorithm for asthma wheezing recognition in children's respiratory sounds , 2015, Biomed. Signal Process. Control..

[2]  Lei Wang,et al.  Coarse-to-Fine Image Inpainting via Region-wise Convolutions and Non-Local Correlation , 2019, IJCAI.

[3]  Tae-Ho Kim,et al.  Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer , 2019, INTERSPEECH.

[4]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[5]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[6]  Yonghong Song,et al.  A coarse-to-fine scene text detection method based on Skeleton-cut detector and Binary-Tree-Search based rectification , 2018, Pattern Recognit. Lett..

[7]  Hao Meng,et al.  Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network , 2019, IEEE Access.

[8]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[9]  Yingli Tian,et al.  Coarse-to-Fine Semantic Segmentation From Image-Level Labels , 2018, IEEE Transactions on Image Processing.

[10]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[11]  Zhaocheng Huang,et al.  Prediction of Emotion Change From Speech , 2018, Front. ICT.

[12]  Meikang Qiu,et al.  Feedback Dynamic Algorithms for Preemptable Job Scheduling in Cloud Systems , 2010 .

[13]  Xiu-Shen Wei,et al.  Coarse-to-fine: A RNN-based hierarchical attention model for vehicle re-identification , 2018, ACCV.

[14]  Xiaolong Wang,et al.  Coarse-to-fine sentence-level emotion classification based on the intra-sentence features and sentential context , 2012, CIKM.

[15]  Jing Han,et al.  Compact Convolutional Recurrent Neural Networks via Binarization for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Meikang Qiu,et al.  Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP , 2008, J. Parallel Distributed Comput..

[17]  N. Derzhavina Experience of a Synthetic Approach to an Ecological Classification of Vascular Epiphytes , 2019, Contemporary Problems of Ecology.

[18]  Cristian Sminchisescu,et al.  3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children with Autism , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Kenli Li,et al.  A hierarchical reliability-driven scheduling algorithm in grid systems , 2012, J. Parallel Distributed Comput..

[20]  Thamer Alhussain,et al.  Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[21]  Runnan Li,et al.  Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Fenglong Ma,et al.  Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks , 2017, KDD.

[23]  Meikang Qiu,et al.  A User-Centric Data Protection Method for Cloud Storage Based on Invertible DWT , 2021, IEEE Transactions on Cloud Computing.

[24]  Chunhua Jin,et al.  Two-level Attention with Two-stage Multi-task Learning for Facial Emotion Recognition , 2018, J. Vis. Commun. Image Represent..

[25]  Zhihui Lu,et al.  An efficient key distribution system for data fusion in V2X heterogeneous networks , 2019, Inf. Fusion.

[26]  Björn W. Schuller,et al.  Speech emotion recognition , 2018, Commun. ACM.

[27]  Lianhong Cai,et al.  Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms , 2018, INTERSPEECH.