Machine translation of cortical activity to text with an encoder-decoder framework

A decade after the first successful attempt to decode speech directly from human brain signals, accuracy and speed remain far below that of natural speech or typing. Here we show how to achieve high accuracy from the electrocorticogram at natural-speech rates, even with few data (on the order of half an hour of spoken speech). Taking a cue from recent advances in machine translation and automatic speech recognition, we train a recurrent neural network to map neural signals directly to word sequences (sentences). In particular, the network first encodes a sentence-length sequence of neural activity into an abstract representation, and then decodes this representation, word by word, into an English sentence. For each participant, training data consist of several spoken repeats of a set of some 30-50 sentences, along with the corresponding neural signals at each of about 250 electrodes distributed over peri-Sylvian speech cortices. Average word error rates across a validation (held-out) sentence set are as low as 7% for some participants, as compared to the previous state of the art of greater than 60%. Finally, we show how to use transfer learning to overcome limitations on data availability: Training certain components of the network under multiple participants’ data, while keeping other components (e.g., the first hidden layer) “proprietary,” can improve decoding performance—despite very different electrode coverage across participants.

[1]  Paula Katavolos,et al.  Effect of selective LRRK2 kinase inhibition on nonhuman primate lung , 2015, Science Translational Medicine.

[2]  Tanja Schultz,et al.  Brain-to-text: decoding spoken phrases from phone representations in the brain , 2015, Front. Neurosci..

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[5]  Matthew K. Leonard,et al.  The Encoding of Speech Sounds in the Superior Temporal Gyrus , 2019, Neuron.

[6]  Kristofer E. Bouchard,et al.  Functional Organization of Human Sensorimotor Cortex for Speech Articulation , 2013, Nature.

[7]  Elaine Toms,et al.  Measuring the acceptable word error rate of machine-generated webcast transcripts , 2006, INTERSPEECH.

[8]  Matthew K. Leonard,et al.  Human Sensorimotor Cortex Control of Directly Measured Vocal Tract Movements during Vowel Production , 2018, The Journal of Neuroscience.

[9]  Geoffrey Zweig,et al.  Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  D. Poeppel,et al.  Health, USA Reviewed by: , 2010 .

[11]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[12]  Amy Neustein,et al.  Advances in Speech Recognition , 2010 .

[13]  K. Prodanova,et al.  Modeling data for tilted implants in grafted with bio-oss maxillary sinuses using logistic regression , 2014 .

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Jack Mostow,et al.  Direct Transfer of Learned Information Among Neural Networks , 1991, AAAI.

[17]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Tim Alberdingk Thijm,et al.  Digitization of the Canadian Parliamentary Debates , 2017, Canadian Journal of Political Science.

[20]  Joseph G. Makin,et al.  Real-time decoding of question-and-answer speech dialogue using human cortical activity , 2019, Nature Communications.

[21]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[22]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[23]  F. Guenther,et al.  Classification of Intended Phoneme Production from Chronic Intracortical Microelectrode Recordings in Speech-Motor Cortex , 2011, Front. Neurosci..

[24]  Matthew K. Leonard,et al.  The Control of Vocal Pitch in Human Laryngeal Motor Cortex , 2018, Cell.

[25]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[26]  Anish A. Sarma,et al.  Clinical translation of a high-performance neural prosthesis , 2015, Nature Medicine.

[27]  G. Schalk,et al.  Decoding vowels and consonants in spoken and imagined words using electrocorticographic signals in humans , 2011, Journal of neural engineering.

[28]  Nicolas Y. Masse,et al.  Virtual typing by people with tetraplegia using a self-calibrating intracortical brain-computer interface , 2015, Science Translational Medicine.

[29]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[30]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[31]  Massimiliano Pontil,et al.  Multi-task Learning , 2020, Transfer Learning.

[32]  U. Germann Aligned Hansards of the 36th Parliament of Canada , 2001 .

[33]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[34]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[35]  Frank H. Guenther,et al.  Artificial speech synthesizer control by brain-computer interface , 2009, INTERSPEECH.

[36]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[37]  Min Zhang,et al.  Semaphorin3A induces nerve regeneration in the adult cornea-a switch from its repulsive role in development , 2018, PloS one.

[38]  Karen Livescu,et al.  Differential Representation of Articulatory Gestures and Phonemes in Precentral and Inferior Frontal Gyri , 2018, The Journal of Neuroscience.

[39]  E. Chang,et al.  Human cortical sensorimotor network underlying feedback control of vocal pitch , 2013, Proceedings of the National Academy of Sciences.

[40]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[41]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[42]  S. Snyder,et al.  Novel neural modulators. , 2003, Annual review of neuroscience.

[43]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[44]  Francis R. Willett,et al.  Decoding Speech from Intracortical Multielectrode Arrays in Dorsal “Arm/Hand Areas” of Human Motor Cortex , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[45]  Chethan Pandarinath,et al.  Cortical control of a tablet computer by people with paralysis , 2018, PloS one.

[46]  Youssef Ezzyat,et al.  Does data cleaning improve brain state classification? , 2019, Journal of Neuroscience Methods.

[47]  HuangXuedong,et al.  Toward Human Parity in Conversational Speech Recognition , 2017 .

[48]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[49]  M. Kahana,et al.  Synchronous and Asynchronous Theta and Gamma Activity during Episodic Memory Formation , 2013, The Journal of Neuroscience.