A Siamese Neural Network with Modified Distance Loss For Transfer Learning in Speech Emotion Recognition Kexin Feng

Automatic emotion recognition plays a significant role in the process of human computer interaction and the design of Internet of Things (IOT) technologies. Yet, a common problem in emotion recognition systems lies in the scarcity of reliable labels. By modeling pairwise differences between samples of interest, a Siamese network can help to mitigate this challenge since it requires fewer samples than traditional deep learning methods. In this paper, we propose a distance loss, which can be applied on the Siamese network fine-tuning, by optimizing the model based on the relevant distance between same and difference class pairs. Our system use samples from the source data to pre-train the weights of proposed Siamese neural network, which are fine-tuned based on the target data. We present an emotion recognition task that uses speech, since it is one of the most ubiquitous and frequently used bio-behavioral signals. Our target data comes from the RAVDESS dataset, while the CREMA-D and eNTERFACE'05 are used as source data, respectively. Our results indicate that the proposed distance loss is able to greatly benefit the fine-tuning process of Siamese network. Also, the selection of source data has more effect on the Siamese network performance compared to the number of frozen layers. These suggest the great potential of applying the Siamese network and modelling pairwise differences in the field of transfer learning for automatic emotion recognition.

[1]  Takio Kurita,et al.  Facial expression intensity estimation using Siamese and triplet networks , 2018, Neurocomputing.

[2]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[3]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[4]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[5]  Jian Huang,et al.  Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function , 2018, INTERSPEECH.

[6]  Ya Li,et al.  Speech Emotion Recognition via Contrastive Loss under Siamese Networks , 2018, Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data.

[7]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[8]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[9]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[10]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[11]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[12]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.