Toward a realistic model of speech processing in the brain with self-supervised learning

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

[1]  J. King,et al.  Deep language algorithms predict semantic comprehension from brain activity , 2022, Scientific Reports.

[2]  Evelina Fedorenko,et al.  An investigation across 45 languages and 12 language families reveals a universal language network , 2022, Nature Neuroscience.

[3]  Ewan Dunbar,et al.  Do self-supervised speech models develop human-like perception biases? , 2022, ACL.

[4]  Alexander G. Huth,et al.  Self-supervised models of audio effectively explain human cortical responses to speech , 2022, ICML.

[5]  Omer Levy,et al.  Shared computational principles for language processing in humans and deep language models , 2022, Nature Neuroscience.

[6]  Jakob Drachmann Havtorn,et al.  A Brief Overview of Unsupervised Neural Speech Representation Learning , 2022, ArXiv.

[7]  J. King,et al.  Brains and algorithms partially converge in natural language processing , 2022, Communications Biology.

[8]  T. Zhao,et al.  Encoding of speech in convolutional layers and the brain stem based on language experience , 2022, bioRxiv.

[9]  Yann LeCun,et al.  VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , 2021, ICLR.

[10]  Ewan Dunbar,et al.  Predicting non-native speech perception using the Perceptual Assimilation Model and state-of-the-art acoustic models , 2022, CONLL.

[11]  Alexandre Gramfort,et al.  Model-based analysis of brain activity reveals the hierarchy of language in 305 subjects , 2021, EMNLP.

[12]  R. N. Spreng,et al.  Le Petit Prince: A multilingual fMRI corpus using ecological stimuli , 2021, bioRxiv.

[13]  Karen Livescu,et al.  Layer-Wise Analysis of a Self-Supervised Speech Representation Model , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Javier Turek,et al.  Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses , 2021, NeurIPS.

[15]  Dan F. M. Goodman,et al.  The Psychometrics of Automatic Speech Recognition , 2021, bioRxiv.

[16]  Alexandre Gramfort,et al.  Disentangling syntax and semantics in the brain with deep networks , 2021, ICML.

[17]  Juliette Millet,et al.  Inductive biases, pretraining and fine-tuning jointly account for brain responses to speech , 2021, ArXiv.

[18]  M. Schönwiesner,et al.  Training neural networks to recognize speech increased their correspondence to the human auditory pathway but did not yield a shared hierarchy of acoustic features , 2021, bioRxiv.

[19]  Stanislas Dehaene,et al.  Can RNNs learn Recursive Nested Subject-Verb Agreements? , 2021, ArXiv.

[20]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[21]  N. Mesgarani,et al.  Learning Speech Production and Perception through Sensorimotor Interactions , 2020, Cerebral cortex communications.

[22]  Gabriel Synnaeve,et al.  Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Eghbal A. Hosseini,et al.  The neural architecture of language: Integrative modeling converges on predictive processing , 2020, Proceedings of the National Academy of Sciences.

[24]  Christopher J. Honey,et al.  Narratives: fMRI data for evaluating models of naturalistic language comprehension , 2020, bioRxiv.

[25]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[26]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[27]  Hanlin Tang,et al.  Untangling in Invariant Speech Recognition , 2020, NeurIPS.

[28]  Daniel Schwartz,et al.  Inducing brain-relevant bias in natural language processing models , 2019, NeurIPS.

[29]  S. Furukawa,et al.  Cascaded Tuning to Amplitude Modulation for Natural Sound Recognition , 2019, The Journal of Neuroscience.

[30]  Leila Wehbe,et al.  Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain) , 2019, NeurIPS.

[31]  Josh H McDermott,et al.  Deep neural network models of sensory systems: windows onto the role of task constraints , 2019, Current Opinion in Neurobiology.

[32]  Radoslaw Martin Cichy,et al.  Deep Neural Networks as Scientific Models , 2019, Trends in Cognitive Sciences.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[34]  Mounya Elhilali,et al.  Connecting Deep Neural Networks to Physical, Perceptual, and Electrophysiological Auditory Signals , 2018, Front. Neurosci..

[35]  Tim C Kietzmann,et al.  Deep Neural Networks in Computational Neuroscience , 2018, bioRxiv.

[36]  Matthew K. Leonard,et al.  The Control of Vocal Pitch in Human Laryngeal Motor Cortex , 2018, Cell.

[37]  Alexander G. Huth,et al.  Incorporating Context into Language Encoding Models for fMRI , 2018, bioRxiv.

[38]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[39]  Nancy Kanwisher,et al.  Toward a universal decoder of linguistic meaning from brain activation , 2018, Nature Communications.

[40]  Emmanuel Dupoux,et al.  Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[41]  O. Bohn Cross‐Language and Second Language Speech Perception , 2017 .

[42]  Nick F Ramsey,et al.  Neural Tuning to Low-Level Features of Speech throughout the Perisylvian Cortex , 2017, The Journal of Neuroscience.

[43]  John H L Hansen,et al.  Mapping the Early Language Environment Using All-Day Recordings and Automated Analysis. , 2017, American journal of speech-language pathology.

[44]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Thomas Schatz ABX-Discriminability Measures and Applications , 2016 .

[46]  Xuanjing Huang,et al.  Bridging LSTM Architecture and the Neural Dynamics during Reading , 2016, IJCAI.

[47]  Thomas L. Griffiths,et al.  Supplementary Information for Natural Speech Reveals the Semantic Maps That Tile Human Cerebral Cortex , 2022 .

[48]  J. DiCarlo,et al.  Using goal-driven deep learning models to understand sensory cortex , 2016, Nature Neuroscience.

[49]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[50]  Nikolaus Kriegeskorte,et al.  Deep neural networks: a new framework for modelling biological vision and brain information processing , 2015, bioRxiv.

[51]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Alice E. Milne,et al.  Different forms of effective connectivity in primate frontotemporal pathways , 2015, Nature Communications.

[53]  Robert D Flint,et al.  Direct classification of all American English phonemes using signals from functional speech motor cortex , 2014, Journal of neural engineering.

[54]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[55]  Michael Eickenberg,et al.  Machine learning for neuroimaging with scikit-learn , 2014, Front. Neuroinform..

[56]  David Poeppel,et al.  Towards a New Neurobiology of Language , 2012, The Journal of Neuroscience.

[57]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[58]  Jack L. Gallant,et al.  Encoding and decoding in fMRI , 2011, NeuroImage.

[59]  C. Honey,et al.  Topographic Mapping of a Hierarchy of Temporal Receptive Windows Using a Narrated Story , 2011, The Journal of Neuroscience.

[60]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[61]  Anders M. Dale,et al.  Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature , 2010, NeuroImage.

[62]  Bradley Greger,et al.  Decoding spoken words using local field potentials recorded from the cortical surface , 2010, Journal of neural engineering.

[63]  Y. Benjamini Discovering the false discovery rate , 2010 .

[64]  J. Henrich,et al.  The weirdest people in the world? , 2010, Behavioral and Brain Sciences.

[65]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[66]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[67]  P. Kuhl,et al.  Early Speech Perception and Later Language Development: Implications for the "Critical Period" , 2005 .

[68]  P. Hagoort On Broca, brain, and binding: a new framework , 2005, Trends in Cognitive Sciences.

[69]  Angela D. Friederici,et al.  The Neurobiology of Language Comprehension , 1998 .

[70]  B. Hart,et al.  American Parenting of Language-Learning Children: Persisting Differences in Family-Child Interactions Observed in Natural Home Environments. , 1992 .

[71]  C. B. Colby The weirdest people in the world , 1973 .

[72]  L. Humphreys Acquisition and extinction of verbal expectations in a situation analogous to conditioning. , 1939 .