论文信息 - Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the baseline results achieved by combining ground-truth transcripts. In addition, we also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.

P. Bell | Yuanchao Li | Catherine Lai

[1] Kenneth Ward Church,et al. Speech Emotion Recognition with Multi-Task Learning , 2021, Interspeech.

[2] Karen Livescu,et al. Layer-Wise Analysis of a Self-Supervised Speech Representation Model , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3] Tatsuya Kawahara,et al. End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model , 2020, INTERSPEECH.

[4] Homayoon Beigi,et al. A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition , 2020, ArXiv.

[5] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6] Jilt Sebastian,et al. Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts , 2019, INTERSPEECH.

[7] Saurabh Sahu,et al. Multi-Modal Learning for Speech Emotion Recognition: An Analysis and Comparison of ASR Outputs with Ground Truth Transcription , 2019, INTERSPEECH.

[8] Tatsuya Kawahara,et al. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[9] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[10] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11] Kyomin Jung,et al. Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12] Johanna D. Moore,et al. Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[13] Margaret Lech,et al. On the Correlation and Transferability of Features Between Automatic Speech Recognition and Speech Emotion Recognition , 2016, INTERSPEECH.

[14] Johanna D. Moore,et al. Emotion recognition in spontaneous and acted dialogues , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[15] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[17] Björn W. Schuller,et al. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18] Rosalind W. Picard,et al. A computational model for the automatic recognition of affect in speech , 2004 .

[19] Astrid Paeschke,et al. Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements , 2000 .