论文信息 - Speaker Normalization for Self-Supervised Speech Emotion Recognition

Speaker Normalization for Self-Supervised Speech Emotion Recognition

Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model’s ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.

Hagai Aronowitz | E. Morais | Weizhong Zhu | R. Hoory | Itai Gat

[1] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2] Andy T. Liu,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[3] Nicolas Obin,et al. Speaker Attentive Speech Emotion Recognition , 2021, Interspeech.

[4] Luciana Ferrer,et al. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings , 2021, Interspeech.

[5] James R. Glass,et al. Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[6] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Hongtao Song,et al. WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition , 2020, INTERSPEECH.

[8] Roger K. Moore,et al. Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition , 2020, INTERSPEECH.

[9] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[10] James R. Glass,et al. Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[11] Joon Son Chung,et al. Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Armand Joulin,et al. Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Andy T. Liu,et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14] Jianyou Wang,et al. Speech Emotion Recognition with Dual-Sequence LSTM Architecture , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Tatsuya Kawahara,et al. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[16] Björn Schuller,et al. Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition , 2019, INTERSPEECH.

[17] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[18] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[19] Kyomin Jung,et al. Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[20] Wu Guo,et al. An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition , 2018, INTERSPEECH.

[21] Jing Yang,et al. 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[22] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[23] S. R. Livingstone,et al. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[24] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[25] Seyedmahdad Mirsamadi,et al. Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Björn W. Schuller,et al. Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Victor S. Lempitsky,et al. Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[30] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[31] Björn W. Schuller,et al. Combining frame and turn-level information for robust recognition of emotions within speech , 2007, INTERSPEECH.