Multi-Modal Emotion Detection with Transfer Learning

Automated emotion detection in speech is a challenging task due to the complex interdependence between words and the manner in which they are spoken. It is made more difficult by the available datasets; their small size and incompatible labeling idiosyncrasies make it hard to build generalizable emotion detection systems. To address these two challenges, we present a multi-modal approach that first transfers learning from related tasks in speech and text to produce robust neural embeddings and then uses these embeddings to train a pLDA classifier that is able to adapt to previously unseen emotions and domains. We begin by training a multilayer TDNN on the task of speaker identification with the VoxCeleb corpora and then fine-tune it on the task of emotion identification with the Crema-D corpus. Using this network, we extract speech embeddings for Crema-D from each of its layers, generate and concatenate text embeddings for the accompanying transcripts using a fine-tuned BERT model and then train an LDA - pLDA classifier on the resulting dense representations. We exhaustively evaluate the predictive power of every component: the TDNN alone, speech embeddings from each of its layers alone, text embeddings alone and every combination thereof. Our best variant, trained on only VoxCeleb and Crema-D and evaluated on IEMOCAP, achieves an EER of 38.05%. Including a portion of IEMOCAP during training produces a 5-fold averaged EER of 25.72% (For comparison, 44.71% of the gold-label annotations include at least one annotator who disagrees).

[1]  P. Ekman,et al.  Approaches To Emotion , 1985 .

[2]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[3]  Chan Woo Lee,et al.  Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data , 2018, ArXiv.

[4]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[5]  Kyomin Jung,et al.  Attentive Modality Hopping Mechanism for Speech Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alex Waibel,et al.  Bimodal Speech Emotion Recognition Using Pre-Trained Language Models , 2019, ArXiv.

[7]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[8]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9]  Homayoon S. M. Beigi,et al.  Cross Lingual Cross Corpus Speech Emotion Recognition , 2020, ArXiv.

[10]  Jianhua Tao,et al.  Domain adversarial learning for emotion recognition , 2019, ArXiv.

[11]  Rajib Rana,et al.  Direct Modelling of Speech Emotion from Raw Speech , 2019, INTERSPEECH.

[12]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[13]  Dinesh Manocha,et al.  M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues , 2020, AAAI.

[14]  Nanxin Chen,et al.  X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Homayoon S. M. Beigi,et al.  Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning , 2018, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Xudong Zhao,et al.  A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition , 2020, INTERSPEECH.

[18]  Rajib Rana Poster: Context-driven Mood Mining , 2016, MobiSys '16 Companion.

[19]  Xiaoyu Shen,et al.  DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[20]  Homayoon Beigi,et al.  A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition , 2020, ArXiv.

[21]  Suraj Tripathi,et al.  Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions , 2019, ArXiv.

[22]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[23]  Guodong Guo,et al.  Automated Depression Diagnosis Based on Deep Networks to Encode Facial Appearance and Dynamics , 2018, IEEE Transactions on Affective Computing.

[24]  Oliver G. B. Garrod,et al.  Dynamic Facial Expressions of Emotion Transmit an Evolving Hierarchy of Signals over Time , 2014, Current Biology.

[25]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[26]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[27]  Rajib Rana,et al.  Automated Screening for Distress: A Perspective for the Future , 2019, European journal of cancer care.

[28]  Figen Ertaş,et al.  FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[29]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[30]  Julia Hirschberg,et al.  Classifying subject ratings of emotional speech using acoustic features , 2003, INTERSPEECH.

[31]  Rajib Rana,et al.  Context-driven mood mining , 2016 .

[32]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[35]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[36]  Grigoriy Sterling,et al.  Emotion Recognition From Speech With Recurrent Neural Networks , 2017, ArXiv.

[37]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[38]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.