Generating fMRI-Enriched Acoustic Vectors using a Cross-Modality Adversarial Network for Emotion Recognition

Automatic emotion recognition has long been developed by concentrating on modeling human expressive behavior. At the same time, neuro-scientific evidences have shown that the varied neuro-responses (i.e., blood oxygen level-dependent (BOLD) signals measured from the functional magnetic resonance imaging (fMRI)) is also a function on the types of emotion perceived. While past research has indicated that fusing acoustic features and fMRI improves the overall speech emotion recognition performance, obtaining fMRI data is not feasible in real world applications. In this work, we propose a cross modality adversarial network that jointly models the bi-directional generative relationship between acoustic features of speech samples and fMRI signals of human percetual responses by leveraging a parallel dataset. We encode the acoustic descriptors of a speech sample using the learned cross modality adversarial network to generate the fMRI-enriched acoustic vectors to be used in the emotion classifier. The generated fMRI-enriched acoustic vector is evaluated not only in the parallel dataset but also in an additional dataset without fMRI scanning. Our proposed framework significantly outperform using acoustic features only in a four-class emotion recognition task for both datasets, and the use of cyclic loss in learning the bi-directional mapping is also demonstrated to be crucial in achieving improved recognition rates.

[1]  B. Muthén,et al.  Finite Mixture Modeling with Mixture Outcomes Using the EM Algorithm , 1999, Biometrics.

[2]  T. Johnstone,et al.  The voice of emotion: an FMRI study of neural responses to angry and happy vocal expressions. , 2006, Social cognitive and affective neuroscience.

[3]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[4]  J. Russell A circumplex model of affect. , 1980 .

[5]  N. Tzourio-Mazoyer,et al.  Automated Anatomical Labeling of Activations in SPM Using a Macroscopic Anatomical Parcellation of the MNI MRI Single-Subject Brain , 2002, NeuroImage.

[6]  Yu-Hsien Liao,et al.  Modeling Perceivers Neural-Responses Using Lobe-Dependent Convolutional Neural Network to Improve Speech Emotion Recognition , 2017, INTERSPEECH.

[7]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[8]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[9]  Carlos Busso,et al.  Supervised domain adaptation for emotion recognition from speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tiranee Achalakul,et al.  Emotional healthcare system: Emotion detection by facial expressions using Japanese database , 2014, 2014 6th Computer Science and Electronic Engineering Conference (CEEC).

[11]  M. Landau Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk , 2008 .

[12]  Yufeng Zang,et al.  DPARSF: A MATLAB Toolbox for “Pipeline” Data Analysis of Resting-State fMRI , 2010 .

[13]  G. Dunteman Principal Components Analysis , 1989 .

[14]  T. Ethofer,et al.  Decoding of emotional information in voice-sensitive cortices , 2009, NeuroImage.

[15]  Athanasios Katsamanis,et al.  Automatic classification of married couples' behavior using audio features , 2010, INTERSPEECH.

[16]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[17]  Won-Ki Jeong,et al.  Compressed Sensing MRI Reconstruction Using a Generative Adversarial Network With a Cyclic Loss , 2017, IEEE Transactions on Medical Imaging.

[18]  Sotirios A. Tsaftaris,et al.  Adversarial Image Synthesis for Unpaired Multi-modal Cardiac Data , 2017, SASHIMI@MICCAI.

[19]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[20]  John H. L. Hansen,et al.  In-Vehicle Corpus and Signal Processing for Driver Behavior , 2008 .

[21]  Chaogan Yan,et al.  DPARSF: A MATLAB Toolbox for “Pipeline” Data Analysis of Resting-State fMRI , 2010, Front. Syst. Neurosci..

[22]  D. Mitchell Wilkes,et al.  Acoustical properties of speech as indicators of depression and suicidal risk , 2000, IEEE Transactions on Biomedical Engineering.

[23]  Jing Cai,et al.  The Research on Emotion Recognition from ECG Signal , 2009, 2009 International Conference on Information Technology and Computer Science.

[24]  Hailing Wang,et al.  A Study of Neural Mechanism in Emotion Regulation by Simultaneous Recording of EEG and fMRI Based on ICA , 2013, ISNN.

[25]  Abdul Wahab,et al.  EEG Emotion Recognition System , 2009 .

[26]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[27]  Wolfgang Minker,et al.  Speech and Human-Machine Dialog , 2006 .

[28]  D. Grandjean,et al.  The role of the medial temporal limbic system in processing emotions in voice and music , 2014, Progress in Neurobiology.

[29]  Tong Zhang,et al.  Multi-clue fusion for emotion recognition in the wild , 2016, ICMI.

[30]  K. Zilles,et al.  Recognition of emotional prosody and verbal components of spoken language: an fMRI study. , 2000, Brain research. Cognitive brain research.

[31]  L. de Silva,et al.  Facial emotion recognition using multi-modal information , 1997, Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat..

[32]  Homayoon S. M. Beigi,et al.  Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning , 2018, ArXiv.

[33]  Li-Wei Kuo,et al.  Integrating Perceivers Neural-Perceptual Responses Using a Deep Voting Fusion Network for Automatic Vocal Emotion Decoding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yu-Hsien Liao,et al.  A Gaussian mixture regression approach toward modeling the affective dynamics between acoustically-derived vocal arousal score (VC-AS) and internal brain fMRI bold signal response , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[36]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[37]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Yuval Tassa,et al.  Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[39]  K. Scherer,et al.  The voices of wrath: brain responses to angry prosody in meaningless speech , 2005, Nature Neuroscience.

[40]  Alexei A. Efros,et al.  Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.