Exploring Siamese Neural Network Architectures for Preserving Speaker Identity in Speech Emotion Classification

Voice-enabled communication is increasingly being used in real-world applications, such as the ones involving conversational bots or "chatbots". Chatbots can spark and sustain user engagement by effectively recognizing their emotions and acting upon them. However, the majority of emotion recognition systems rely on rich spectrotemporal acoustic features. Beyond the emotion-related information, such systems tend to preserve information relevant to the identity of the speaker, therefore raising major privacy concerns from the users. This paper introduces two hybrid architectures for privacy-preserving emotion recognition from speech. These architectures rely on a Siamese neural network, whose input and intermediate layers are transformed using various privacy-performing operations in order to retain emotion-dependent content and suppress information related to the identity of a speaker. The proposed approach is evaluated through emotion classification and speaker identification performance metrics. Results indicate that the proposed framework can achieve up to 67.4% for classifying between happy, sad, frustrated, anger, neutral and other emotions, obtained from the publicly available Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. At the same time, the proposed approach reduces speaker identification accuracy to 50%, compared to 81%, the latter being achieved through a feedforward neural network solely trained on the speaker identification task using the same input features.

[1]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[2]  Marimuthu Palaniswami,et al.  An improved scheme for privacy-preserving collaborative anomaly detection , 2016, 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops).

[3]  Wei Li,et al.  Efficient and Privacy-Preserving Voice-Based Search over mHealth Data , 2017, 2017 IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE).

[4]  Russell Beale,et al.  Affect and Emotion in Human-Computer Interaction, From Theory to Applications , 2008, Affect and Emotion in Human-Computer Interaction.

[5]  Nir Kshetri,et al.  Cyberthreats under the Bed , 2018, Computer.

[6]  Muttukrishnan Rajarajan,et al.  Efficient Privacy-Preserving Facial Expression Classification , 2017, IEEE Transactions on Dependable and Secure Computing.

[7]  Traian Marius Truta,et al.  Protecting privacy in recorded conversations , 2008, PAIS '08.

[8]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[9]  Farah Chenchah,et al.  Speech Emotion Recognition in Acted and Spontaneous Context , 2014, IHCI.

[10]  Muttukrishnan Rajarajan,et al.  Privacy preserving encrypted phonetic search of speech data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Marimuthu Palaniswami,et al.  Privacy-Preserving Collaborative Deep Learning with Application to Human Activity Recognition , 2017, CIKM.

[12]  Jihun Hamm Enhancing utility and privacy with noisy minimax filters , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Asbjørn Følstad,et al.  Chatbots and the new world of HCI , 2017, Interactions.

[14]  Xiaohui Liang,et al.  Privacy-preserving voice-based search over mHealth data , 2019 .

[15]  Ibrahiem M. M. El Emary,et al.  Speech emotion recognition approaches in human computer interaction , 2013, Telecommun. Syst..

[16]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[17]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[19]  P. Smaragdis,et al.  Privacy Preserving Speech Processing , 2013 .

[20]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[21]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[22]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[23]  Bhiksha Raj,et al.  Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise , 2013, IEEE Signal Processing Magazine.