Inductive biases, pretraining and fine-tuning jointly account for brain responses to speech

Our ability to comprehend speech remains, to date, unrivaled by deep learning models. This feat could result from the brain’s ability to fine-tune generic sound representations for speech-specific processes. To test this hypothesis, we compare i) five types of deep neural networks to ii) human brain responses elicited by spoken sentences and recorded in 102 Dutch subjects using functional Magnetic Resonance Imaging (fMRI). Each network was either trained on an acoustics scene classification, a speech-to-text task (based on Bengali, English, or Dutch), or not trained. The similarity between each model and the brain is assessed by correlating their respective activations after an optimal linear projection. The differences in brain-similarity across networks revealed three main results. First, speech representations in the brain can be accounted for by random deep networks. Second, learning to classify acoustic scenes leads deep nets to increase their brain similarity. Third, learning to process phonetically-related speech inputs (i.e., Dutch vs English) leads deep nets to reach higher levels of brain-similarity than learning to process phonetically-distant speech inputs (i.e. Dutch vs Bengali). Together, these results suggest that the human brain fine-tunes its heavily-trained auditory hierarchy to learn to process speech.

[1]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[2]  D. V. van Essen,et al.  A Population-Average, Landmark- and Surface-based (PALS) atlas of human cerebral cortex. , 2005, NeuroImage.

[3]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[4]  Y. Benjamini Discovering the false discovery rate , 2010 .

[5]  Nick F Ramsey,et al.  Brain-optimized extraction of complex sound features that drive continuous auditory perception , 2020, PLoS Comput. Biol..

[6]  Daniel P. W. Ellis,et al.  Audio tagging with noisy labels and minimal supervision , 2019, DCASE.

[7]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[8]  Jean-Rémi King,et al.  Language processing in brains and deep neural networks: computational convergence and its limits , 2020 .

[9]  Laura Gwilliams,et al.  Encoding and Decoding Neuronal Dynamics: Methodological Framework to Uncover the Algorithms of Cognition , 2017 .

[10]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[11]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[12]  Lucy S. Petro,et al.  Contextual modulation of primary visual cortex by auditory signals , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[13]  Timo Baumann,et al.  The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening , 2019, Lang. Resour. Evaluation.

[14]  Nicholas B. Turk-Browne,et al.  Searching through functional space reveals distributed visual, auditory, and semantic coding in the human brain , 2020, bioRxiv.

[15]  Satrajit S. Ghosh,et al.  FMRIPrep: a robust preprocessing pipeline for functional MRI , 2018, bioRxiv.

[16]  F. Tong,et al.  Decoding the visual and subjective contents of the human brain , 2005, Nature Neuroscience.

[17]  Mert R. Sabuncu,et al.  Cortical response to naturalistic stimuli is largely predictable with deep neural networks , 2020, Science Advances.

[18]  Piotr Majka,et al.  Unidirectional monosynaptic connections from auditory areas to the primary visual cortex in the marmoset monkey , 2018, Brain Structure and Function.

[19]  S. David,et al.  Integration over Multiple Timescales in Primary Auditory Cortex , 2013, The Journal of Neuroscience.

[20]  J. DiCarlo,et al.  Using goal-driven deep learning models to understand sensory cortex , 2016, Nature Neuroscience.

[21]  Bahar Khalighinejad,et al.  Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models , 2020, eLife.

[22]  Kevin Knight,et al.  Out-of-the-box Universal Romanization Tool uroman , 2018, ACL.

[23]  Steve Renals,et al.  Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[25]  Tomaso Poggio,et al.  Fast Readout of Object Identity from Macaque Inferior Temporal Cortex , 2005, Science.

[26]  Konrad P. Körding,et al.  Toward an Integration of Deep Learning and Neuroscience , 2016, bioRxiv.

[27]  T. Jaeger,et al.  Big data suggest strong constraints of linguistic similarity on adult language learning , 2019, Cognition.

[28]  M. Turvey,et al.  The motor theory of speech perception reviewed , 2006, Psychonomic bulletin & review.

[29]  Joachim Gross,et al.  Simple Acoustic Features Can Explain Phoneme-Based Predictions of Cortical Responses to Speech , 2019, Current Biology.

[30]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[31]  Anders M. Dale,et al.  Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature , 2010, NeuroImage.

[32]  Thomas T. Liu,et al.  A component based noise correction method (CompCor) for BOLD and perfusion based fMRI , 2007, NeuroImage.

[33]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[34]  S. Dehaene,et al.  The unique role of the visual word form area in reading , 2011, Trends in Cognitive Sciences.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Supheakmungkol Sarin,et al.  Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali , 2018, SLTU.

[37]  Vinit Unni,et al.  Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Boris Ginsburg,et al.  Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq , 2018, 1805.10387.

[39]  S. Furukawa,et al.  Cascaded Tuning to Amplitude Modulation for Natural Sound Recognition , 2019, The Journal of Neuroscience.

[40]  S. S. Stevens,et al.  The Relation of Pitch to Frequency: A Revised Scale , 1940 .

[41]  Michael Eickenberg,et al.  Machine learning for neuroimaging with scikit-learn , 2014, Front. Neuroinform..

[42]  Robert Oostenveld,et al.  A 204-subject multimodal neuroimaging dataset to study language processing , 2019, Scientific Data.

[43]  N. Kriegeskorte,et al.  Author ' s personal copy Representational geometry : integrating cognition , computation , and the brain , 2013 .

[44]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[45]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.