Stimulus Speech Decoding from Human Cortex with Generative Adversarial Network Transfer Learning

Decoding auditory stimulus from neural activity can enable neuroprosthetics and direct communication with the brain. Some recent studies have shown successful speech decoding from intracranial recording using deep learning models. However, scarcity of training data leads to low quality speech reconstruction which prevents a complete brain-computer-interface (BCI) application. In this work, we propose a transfer learning approach with a pre-trained GAN to disentangle representation and generation layers for decoding. We first pre-train a generator to produce spectrograms from a representation space using a large corpus of natural speech data. With a small amount of paired data containing the stimulus speech and corresponding ECoG signals, we then transfer it to a bigger network with an encoder attached before, which maps the neural signal to the representation space. To further improve the network generalization ability, we introduce a Gaussian prior distribution regularizer on the latent representation during the transfer phase. With at most 150 training samples for each tested subject, we achieve a state-of-the-art decoding performance. By visualizing the attention mask embedded in the encoder, we observe brain dynamics that are consistent with findings from previous studies investigating dynamics in the superior temporal gyrus (STG), pre-central gyrus (motor) and inferior frontal gyrus (IFG). Our findings demonstrate a high reconstruction accuracy using deep learning networks together with the potential to elucidate interactions across different brain regions during a cognitive task.

[1]  R. Knight,et al.  Redefining the role of Broca’s area in speech , 2015, Proceedings of the National Academy of Sciences.

[2]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[3]  E. F. Chang,et al.  Sub-centimeter language organization in the human temporal lobe , 2011, Brain and Language.

[4]  Ali Farhadi,et al.  AJILE Movement Prediction: Multimodal Deep Learning for Natural Human Neural Recordings and Video , 2017, AAAI.

[5]  Nobuaki Minematsu,et al.  Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder , 2018, IEEE Access.

[6]  Edward F Chang,et al.  The auditory representation of speech sounds in human motor cortex , 2016, eLife.

[7]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[8]  Adeen Flinker,et al.  Reconstructing Speech Stimuli From Human Auditory Cortex Activity Using a WaveNet Approach , 2018, 2018 IEEE Signal Processing in Medicine and Biology Symposium (SPMB).

[9]  Yao Wang,et al.  Long-term prediction of μECOG signals with a spatio-temporal pyramid of adversarial convolutional networks , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[10]  Edward F. Chang,et al.  Speech synthesis from neural decoding of spoken sentences , 2019, Nature.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yao Wang,et al.  Multi Resolution LSTM For Long Term Prediction In Neural Activity Video , 2017, ArXiv.

[13]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Brian N. Pasley,et al.  Reconstructing Speech from Human Auditory Cortex , 2012, PLoS biology.

[16]  David Poeppel,et al.  Spectrotemporal modulation provides a unifying framework for auditory cortical asymmetries , 2019, Nature Human Behaviour.

[17]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[18]  Jiming Liu,et al.  EWGAN: Entropy-Based Wasserstein GAN for Imbalanced Learning , 2019, AAAI.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[21]  Jonas Adler,et al.  Banach Wasserstein GAN , 2018, NeurIPS.

[22]  J. Rauschecker,et al.  Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing , 2009, Nature Neuroscience.

[23]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.