Speaker Normalization for Self-Supervised Speech Emotion Recognition

Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model’s ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.

[1]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[3]  Nicolas Obin,et al.  Speaker Attentive Speech Emotion Recognition , 2021, Interspeech.

[4]  Luciana Ferrer,et al.  Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings , 2021, Interspeech.

[5]  James R. Glass,et al.  Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[6]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Hongtao Song,et al.  WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition , 2020, INTERSPEECH.

[8]  Roger K. Moore,et al.  Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition , 2020, INTERSPEECH.

[9]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[10]  James R. Glass,et al.  Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[11]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andy T. Liu,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Jianyou Wang,et al.  Speech Emotion Recognition with Dual-Sequence LSTM Architecture , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[16]  Björn Schuller,et al.  Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition , 2019, INTERSPEECH.

[17]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[18]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[19]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[20]  Wu Guo,et al.  An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition , 2018, INTERSPEECH.

[21]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[22]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[23]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[24]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[25]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Björn W. Schuller,et al.  Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[30]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[31]  Björn W. Schuller,et al.  Combining frame and turn-level information for robust recognition of emotions within speech , 2007, INTERSPEECH.