Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions

Models that predict brain responses to stimuli provide one measure of understanding of a sensory system, and have many potential applications in science and engineering. Stimulus-computable sensory models are thus a longstanding goal of neuroscience. Deep neural networks have emerged as the leading such predictive models of the visual system, but are less explored in audition. Prior work provided examples of audio-trained neural networks that produced good predictions of auditory cortical fMRI responses and exhibited correspondence between model stages and brain regions, but left it unclear whether these results generalize to other neural network models, and thus how to further improve models in this domain. We evaluated brain-model correspondence for publicly available audio neural network models along with in-house models trained on four different tasks. Most tested models out-predicted previous filter-bank models of auditory cortex, and exhibited systematic model-brain correspondence: middle stages best predicted primary auditory cortex while deep stages best predicted non-primary cortex. However, some state-of-the-art models produced substantially worse brain predictions. The training task influenced the prediction quality for specific cortical tuning properties, with best overall predictions resulting from models trained on multiple tasks. The results suggest the importance of task optimization for explaining brain representations and generally support the promise of deep neural networks as models of audition.

[1]  Josh H. McDermott,et al.  Model metamers illuminate divergences between biological and artificial neural networks , 2023, bioRxiv.

[2]  Anna A. Ivanova,et al.  Beyond linear regression: mapping models in cognitive neuroscience should align with research goals , 2021, bioRxiv.

[3]  Michael F. Bonner,et al.  High-performing neural network models of visual cortex benefit from high latent dimensionality , 2022, bioRxiv.

[4]  Nicholas J. Sexton,et al.  Reassessing hierarchical correspondences between brain and deep networks through direct interface , 2022, Science advances.

[5]  J. King,et al.  Toward a realistic model of speech processing in the brain with self-supervised learning , 2022, NeurIPS.

[6]  Ewan Dunbar,et al.  Do self-supervised speech models develop human-like perception biases? , 2022, ACL.

[7]  Alexander G. Huth,et al.  Self-supervised models of audio effectively explain human cortical responses to speech , 2022, ICML.

[8]  Daniel L. K. Yamins,et al.  Unsupervised Segmentation in Real-World Images via Spelke Object Inference , 2022, ECCV.

[9]  Iwan V. Roberts,et al.  From Microphone to Phoneme: An End-to-End Computational Neural Model for Predicting Speech Perception with Cochlear Implants , 2022, IEEE Transactions on Biomedical Engineering.

[10]  Abdel-rahman Mohamed,et al.  Dissecting neural computations of the human auditory pathway using deep neural networks for speech , 2022, bioRxiv.

[11]  Josh H. McDermott,et al.  A neural population selective for song in human auditory cortex , 2022, Current Biology.

[12]  J. King,et al.  Brains and algorithms partially converge in natural language processing , 2022, Communications Biology.

[13]  Sam V. Norman-Haignere,et al.  Multiscale temporal integration organizes hierarchical computation in human auditory cortex , 2021, Nature human behaviour.

[14]  F. Tong,et al.  Noise-trained deep neural networks effectively predict human vision and its neural responses to challenging images , 2021, PLoS biology.

[15]  N. Kanwisher,et al.  Computational models of category-selective brain regions enable high-throughput tests of selectivity , 2021, Nature Communications.

[16]  Liberty S. Hamilton,et al.  Parallel and distributed encoding of speech across human auditory cortex , 2021, Cell.

[17]  Nikolaus Kriegeskorte,et al.  Diverse Deep Neural Networks All Predict Human Inferior Temporal Cortex Well, After Training and Fitting , 2021, Journal of Cognitive Neuroscience.

[18]  Titouan Parcollet,et al.  SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[19]  Yu Tsao,et al.  MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[20]  Yaoda Xu,et al.  Limits to visual representational correspondence between convolutional neural networks and the human brain , 2021, Nature Communications.

[21]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[22]  Juliette Millet,et al.  Inductive biases, pretraining and fine-tuning jointly account for brain responses to speech , 2021, ArXiv.

[23]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[24]  Josh H. McDermott,et al.  Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception , 2020, Nature Communications.

[25]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Eghbal A. Hosseini,et al.  The neural architecture of language: Integrative modeling converges on predictive processing , 2020, Proceedings of the National Academy of Sciences.

[27]  Michael C. Frank,et al.  Unsupervised neural network models of the ventral visual stream , 2020, Proceedings of the National Academy of Sciences.

[28]  Grace W. Lindsay Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future , 2020, Journal of Cognitive Neuroscience.

[29]  Andrew M. Saxe,et al.  If deep learning is the answer, what is the question? , 2020, Nature Reviews Neuroscience.

[30]  G. Hickok,et al.  Language prediction mechanisms in human auditory cortex , 2020, Nature Communications.

[31]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[32]  J. Pino,et al.  Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq , 2020, AACL.

[33]  Jonas Kubilius,et al.  Integrative Benchmarking to Advance Neurally Mechanistic Models of Human Intelligence , 2020, Neuron.

[34]  Josh H. McDermott,et al.  Deep neural network models of sound localization reveal how perception is adapted to real-world environments , 2020, Nature Human Behaviour.

[35]  Bahar Khalighinejad,et al.  Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models , 2020, eLife.

[36]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[37]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[38]  Fatemeh Khatami,et al.  Spiking network optimized for word recognition in noise predicts auditory system hierarchy , 2020, PLoS Comput. Biol..

[39]  Gabriel Kreiman,et al.  XDream: Finding preferred stimuli for visual neurons using generative networks and gradient-free optimization , 2020, PLoS Comput. Biol..

[40]  H. Kamper,et al.  Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge , 2020, INTERSPEECH.

[41]  M. Del Giudice Effective Dimensionality: A Tutorial , 2020, Multivariate behavioral research.

[42]  Josh H. McDermott,et al.  Music-selective neural populations arise without musical training , 2020, bioRxiv.

[43]  Jonathan Le Roux,et al.  WHAMR!: Noisy and Reverberant Single-Channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Tuomas Virtanen,et al.  Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[46]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[47]  Rachel M. Theodore,et al.  EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition , 2018, Cogn. Sci..

[48]  Surya Ganguli,et al.  A deep learning framework for neuroscience , 2019, Nature Neuroscience.

[49]  Josh H McDermott,et al.  Invariance to background noise as a signature of non-primary auditory cortex , 2019, Nature Communications.

[50]  S. Furukawa,et al.  Cascaded Tuning to Amplitude Modulation for Natural Sound Recognition , 2019, The Journal of Neuroscience.

[51]  Josh H McDermott,et al.  Deep neural network models of sensory systems: windows onto the role of task constraints , 2019, Current Opinion in Neurobiology.

[52]  Nikolaus Kriegeskorte,et al.  Deep Learning for Cognitive Neuroscience , 2019, ArXiv.

[53]  James J DiCarlo,et al.  Neural population control via deep image synthesis , 2018, Science.

[54]  Chris I. Baker,et al.  Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images , 2018, NeuroImage.

[55]  Josh H. McDermott,et al.  Metamers of neural networks reveal divergence from human perceptual systems , 2019, NeurIPS.

[56]  Josh H McDermott,et al.  Neural responses to natural and model-matched stimuli reveal distinct computations in primary and nonprimary auditory cortex , 2018, bioRxiv.

[57]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[58]  Erik Edwards,et al.  A Spatial Map of Onset and Sustained Responses to Speech in the Human Superior Temporal Gyrus , 2018, Current Biology.

[59]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[60]  Nancy Kanwisher,et al.  Toward a universal decoder of linguistic meaning from brain activation , 2018, Nature Communications.

[61]  Thomas L. Griffiths,et al.  Evaluating (and Improving) the Correspondence Between Deep Neural Networks and Human Representations , 2017, Cogn. Sci..

[62]  Frédéric E Theunissen,et al.  The Hierarchical Cortical Organization of Human Speech Processing , 2017, The Journal of Neuroscience.

[63]  Tuomas Virtanen,et al.  Automated audio captioning with recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[64]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[65]  Michael Eickenberg,et al.  Seeing it all: Convolutional network layers map the function of the human visual system , 2017, NeuroImage.

[66]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Jesper Andersson,et al.  A multi-modal parcellation of human cerebral cortex , 2016, Nature.

[69]  Konrad P. Körding,et al.  Toward an Integration of Deep Learning and Neuroscience , 2016, bioRxiv.

[70]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[71]  Timo Baumann,et al.  Mining the Spoken Wikipedia for Speech Data and Beyond , 2016, LREC.

[72]  Thomas L. Griffiths,et al.  Supplementary Information for Natural Speech Reveals the Semantic Maps That Tile Human Cerebral Cortex , 2022 .

[73]  Andrew J. King,et al.  Measuring the Performance of Neural Models , 2016, Front. Comput. Neurosci..

[74]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[76]  Josh H. McDermott,et al.  Distinct Cortical Pathways for Music and Speech Revealed by Hypothesis-Free Voxel Decomposition , 2015, Neuron.

[77]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[78]  Matthew H. Davis,et al.  Hierarchical Organization of Auditory and Motor Representations in Speech Perception: Evidence from Searchlight Similarity Analysis , 2015, Cerebral cortex.

[79]  David Poeppel,et al.  The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts , 2015, Nature Neuroscience.

[80]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Marcel A. J. van Gerven,et al.  Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream , 2014, The Journal of Neuroscience.

[83]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[84]  Wojciech Zaremba,et al.  Deep Neural Networks Predict Category Typicality Ratings for Images , 2015, CogSci.

[85]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[86]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[87]  Essa Yacoub,et al.  Encoding of Natural Sounds at Multiple Spectral and Temporal Resolutions in the Human Auditory Cortex , 2014, PLoS Comput. Biol..

[88]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[89]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[90]  Y. Cohen,et al.  The what, where and how of auditory-object perception , 2013, Nature Reviews Neuroscience.

[91]  Virginia B. Penhune,et al.  Sensitive periods in human development: Evidence from musical training , 2011, Cortex.

[92]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[93]  Jack L. Gallant,et al.  Encoding and decoding in fMRI , 2011, NeuroImage.

[94]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[95]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[96]  J. Rauschecker,et al.  Segregation of Vowels and Consonants in Human Auditory Cortex: Evidence for Distributed Hierarchical Organization , 2010, Front. Psychology.

[97]  Jonathan H. Venezia,et al.  Hierarchical organization of human auditory cortex: evidence from acoustic invariance in the response to intelligible speech. , 2010, Cerebral cortex.

[98]  E. Chang,et al.  Categorical Speech Representation in Human Superior Temporal Gyrus , 2010, Nature Neuroscience.

[99]  Matthew H. Davis,et al.  Hierarchical Processing for Speech in Human Auditory Cortex and Beyond , 2010, Front. Hum. Neurosci..

[100]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  J. Rauschecker,et al.  Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing , 2009, Nature Neuroscience.

[102]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[103]  S. Lomber,et al.  Double dissociation of 'what' and 'where' processing in auditory cortex , 2008, Nature Neuroscience.

[104]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[105]  B. Shinn-Cunningham,et al.  Task-modulated “what” and “where” pathways in human auditory cortex , 2006, Proceedings of the National Academy of Sciences.

[106]  Roy D. Patterson,et al.  Locating the initial stages of speech–sound processing in human temporal cortex , 2006, NeuroImage.

[107]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[108]  David A. Medler,et al.  Cerebral Cortex doi:10.1093/cercor/bhi040 Cerebral Cortex Advance Access published February 9, 2005 , 2022 .

[109]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[110]  T. Griffiths,et al.  Distinct Mechanisms for Processing Spatial Sequences and Pitch Sequences in the Human Auditory Brain , 2003, The Journal of Neuroscience.

[111]  David D. Cox,et al.  Functional magnetic resonance imaging (fMRI) “brain reading”: detecting and classifying distributed patterns of fMRI activity in human visual cortex , 2003, NeuroImage.

[112]  C. Grady,et al.  “What” and “where” in the human auditory system , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[113]  Stephen M. Smith,et al.  A global optimisation method for robust affine registration of brain images , 2001, Medical Image Anal..

[114]  J. Rauschecker,et al.  Hierarchical Organization of the Human Auditory Cortex Revealed by Functional Magnetic Resonance Imaging , 2001, Journal of Cognitive Neuroscience.

[115]  J. Rauschecker,et al.  Mechanisms and streams for processing of "what" and "where" in auditory cortex. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[116]  Anders M. Dale,et al.  Cortical Surface-Based Analysis I. Segmentation and Surface Reconstruction , 1999, NeuroImage.

[117]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[118]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[119]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.