Simple Acoustic Features Can Explain Phoneme-Based Predictions of Cortical Responses to Speech

Summary When we listen to speech, we have to make sense of a waveform of sound pressure. Hierarchical models of speech perception assume that, to extract semantic meaning, the signal is transformed into unknown, intermediate neuronal representations. Traditionally, studies of such intermediate representations are guided by linguistically defined concepts, such as phonemes. Here, we argue that in order to arrive at an unbiased understanding of the neuronal responses to speech, we should focus instead on representations obtained directly from the stimulus. We illustrate our view with a data-driven, information theoretic analysis of a dataset of 24 young, healthy humans who listened to a 1 h narrative while their magnetoencephalogram (MEG) was recorded. We find that two recent results, the improved performance of an encoding model in which annotated linguistic and acoustic features were combined and the decoding of phoneme subgroups from phoneme-locked responses, can be explained by an encoding model that is based entirely on acoustic features. These acoustic features capitalize on acoustic edges and outperform Gabor-filtered spectrograms, which can explicitly describe the spectrotemporal characteristics of individual phonemes. By replicating our results in publicly available electroencephalography (EEG) data, we conclude that models of brain responses based on linguistic features can serve as excellent benchmarks. However, we believe that in order to further our understanding of human cortical responses to speech, we should also explore low-level and parsimonious explanations for apparent high-level phenomena.

[1]  Angelo Plastino,et al.  Perturbative Treatment of the Non-Linear q-Schrödinger and q-Klein-Gordon Equations , 2016, Entropy.

[2]  L. Elliot Hong,et al.  Rapid Transformation from Auditory to Linguistic Representations of Continuous Speech , 2018, Current Biology.

[3]  Alexander Bertrand,et al.  Auditory-Inspired Speech Envelope Extraction Methods for Improved EEG-Based Auditory Attention Detection in a Cocktail Party Scenario , 2017, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[4]  Joachim Gross,et al.  Perceptually relevant speech tracking in auditory and motor cortex reflects distinct linguistic features , 2018, PLoS biology.

[5]  Liberty S. Hamilton,et al.  The revolution will not be controlled: natural stimuli in speech neuroscience , 2018, Language, cognition and neuroscience.

[6]  Kendrick N. Kay,et al.  Principles for models of neural information processing , 2017, NeuroImage.

[7]  Nick F Ramsey,et al.  Neural Tuning to Low-Level Features of Speech throughout the Perisylvian Cortex , 2017, The Journal of Neuroscience.

[8]  Oded Ghitza,et al.  The theta-syllable: a unit of speech information defined by cortical function , 2013, Front. Psychol..

[9]  Guillaume A. Rousselet,et al.  A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula , 2016, bioRxiv.

[10]  Jack L. Gallant,et al.  Encoding and decoding in fMRI , 2011, NeuroImage.

[11]  Nikolaus Kriegeskorte,et al.  Interpreting encoding and decoding models , 2018, Current Opinion in Neurobiology.

[12]  Thomas L. Griffiths,et al.  Supplementary Information for Natural Speech Reveals the Semantic Maps That Tile Human Cerebral Cortex , 2022 .

[13]  Tonio Ball,et al.  Causal interpretation rules for encoding and decoding models in neuroimaging , 2015, NeuroImage.

[14]  J. Simon,et al.  Emergence of neural encoding of auditory objects while listening to competing speakers , 2012, Proceedings of the National Academy of Sciences.

[15]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[16]  Wiktor Mlynarski,et al.  Learning Midlevel Auditory Codes from Natural Sound Statistics , 2017, Neural Computation.

[17]  Stefano Panzeri,et al.  Contributions of local speech encoding and functional connectivity to audio-visual speech perception , 2017, eLife.

[18]  Michael Eickenberg,et al.  Machine learning for neuroimaging with scikit-learn , 2014, Front. Neuroinform..

[19]  Lori L. Holt,et al.  The Illusion of the Phoneme , 2000 .

[20]  G. Nolte The magnetic lead field theorem in the quasi-static approximation and its use for magnetoencephalography forward calculation in realistic volume conductors. , 2003, Physics in medicine and biology.

[21]  Romain Brette,et al.  Is coding a relevant metaphor for the brain? , 2017, bioRxiv.

[22]  Essa Yacoub,et al.  Reconstructing the spectrotemporal modulations of real-life sounds from fMRI response patterns , 2017, Proceedings of the National Academy of Sciences.

[23]  J. Obleser,et al.  Pre-lexical abstraction of speech in the auditory cortex , 2009, Trends in Cognitive Sciences.

[24]  Edmund C. Lalor,et al.  Low-Frequency Cortical Entrainment to Speech Reflects Phoneme-Level Processing , 2015, Current Biology.

[25]  J. Wagemans,et al.  Is neuroimaging measuring information in the brain? , 2016, Psychonomic Bulletin & Review.

[26]  D H Brainard,et al.  The Psychophysics Toolbox. , 1997, Spatial vision.

[27]  E. Chang,et al.  A speech envelope landmark for syllable encoding in human superior temporal gyrus , 2018, Science Advances.

[28]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[29]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[30]  Frédéric E Theunissen,et al.  The Hierarchical Cortical Organization of Human Speech Processing , 2017, The Journal of Neuroscience.

[31]  Alexandre Gramfort,et al.  Automated model selection in covariance estimation and spatial whitening of MEG and EEG signals , 2015, NeuroImage.

[32]  Tobias Reichenbach,et al.  The human auditory brainstem response to running speech reveals a subcortical mechanism for selective attention , 2017, bioRxiv.

[33]  Edmund C. Lalor,et al.  Electrophysiological Correlates of Semantic Dissimilarity Reflect the Comprehension of Natural, Narrative Speech , 2017, Current Biology.

[34]  D. Pisoni,et al.  Acoustic-phonetic representations in word recognition , 1987, Cognition.

[35]  W. Drongelen,et al.  Localization of brain electrical activity via linearly constrained minimum variance spatial filtering , 1997, IEEE Transactions on Biomedical Engineering.

[36]  M. Schönwiesner,et al.  Spectro-temporal modulation transfer function of single voxels in the human auditory cortex measured with high-resolution fMRI , 2009, Proceedings of the National Academy of Sciences.

[37]  Robert Oostenveld,et al.  FieldTrip: Open Source Software for Advanced Analysis of MEG, EEG, and Invasive Electrophysiological Data , 2010, Comput. Intell. Neurosci..

[38]  David Poeppel,et al.  Cortical oscillations and speech processing: emerging computational principles and operations , 2012, Nature Neuroscience.

[39]  Nikolaus Kriegeskorte,et al.  Cognitive computational neuroscience , 2018, Nature Neuroscience.

[40]  N. Mesgarani,et al.  Dynamic Encoding of Acoustic Features in Neural Responses to Continuous Speech , 2017, The Journal of Neuroscience.

[41]  Essa Yacoub,et al.  Encoding of Natural Sounds at Multiple Spectral and Temporal Resolutions in the Human Auditory Cortex , 2014, PLoS Comput. Biol..

[42]  Richard N. Henson,et al.  Adaptive cortical parcellations for source reconstructed EEG/MEG connectomes , 2017, NeuroImage.

[43]  D. Heeger,et al.  Slow Cortical Dynamics and the Accumulation of Information over Long Timescales , 2012, Neuron.

[44]  Christian Brodbeck,et al.  Neural source dynamics of brain responses to continuous stimuli: Speech processing from acoustics to comprehension , 2017, NeuroImage.

[45]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[46]  Jona Sassenhagen,et al.  How to analyse electrophysiological responses to naturalistic language with time-resolved multiple regression , 2018, Language, Cognition and Neuroscience.

[47]  Paul-Christian Bürkner,et al.  brms: An R Package for Bayesian Multilevel Models Using Stan , 2017 .

[48]  Satrajit S. Ghosh,et al.  Mapping the human subcortical auditory system using histology, post mortem MRI and in vivo MRI at 7T , 2019, bioRxiv.

[49]  Josh H McDermott,et al.  Neural responses to natural and model-matched stimuli reveal distinct computations in primary and nonprimary auditory cortex , 2018, bioRxiv.

[50]  Alexandre Hyafil,et al.  Speech encoding by coupled cortical theta and gamma oscillations , 2015, eLife.

[51]  Sam R. Johnson,et al.  Temporal dynamics of sinusoidal and non‐sinusoidal amplitude modulation , 2010, The European journal of neuroscience.

[52]  Colin Klein,et al.  Ghosts in machine learning for cognitive neuroscience: Moving from data to theory , 2017, NeuroImage.

[53]  David Poeppel,et al.  Acoustic landmarks drive delta–theta oscillations to enable speech comprehension by facilitating perceptual parsing , 2014, NeuroImage.

[54]  Roel M. Willems,et al.  Grounding the neurobiology of language in first principles: The necessity of non-language-centric explanations for language comprehension , 2018, Cognition.

[55]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Hermann Ackermann,et al.  Magnetic brain activity phase-locked to the envelope, the syllable onsets, and the fundamental frequency of a perceived speech signal. , 2012, Psychophysiology.

[57]  N. Kanwisher,et al.  Only some spatial patterns of fMRI response are read out in task performance , 2007, Nature Neuroscience.

[58]  Andres Hoyos Idrobo,et al.  Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines , 2016, NeuroImage.

[59]  P. Schyns,et al.  Speech Rhythms and Multiplexed Oscillatory Sensory Coding in the Human Brain , 2013, PLoS biology.

[60]  Erik Edwards,et al.  A Spatial Map of Onset and Sustained Responses to Speech in the Human Superior Temporal Gyrus , 2018, Current Biology.

[61]  Viola Priesemann,et al.  Bits from Brains for Biologically Inspired Computing , 2014, Front. Robot. AI.

[62]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[63]  Mark W. Woolrich,et al.  MEG beamforming using Bayesian PCA for adaptive data covariance matrix regularization , 2011, NeuroImage.

[64]  Randall D. Beer,et al.  Nonnegative Decomposition of Multivariate Information , 2010, ArXiv.

[65]  Julie E. Elie,et al.  Neural processing of natural sounds , 2014, Nature Reviews Neuroscience.

[66]  Luigi Acerbi,et al.  Practical Bayesian Optimization for Model Fitting with Bayesian Adaptive Direct Search , 2017, NIPS.

[67]  Robert T. Knight,et al.  Encoding and Decoding Models in Cognitive Electrophysiology , 2017, Front. Syst. Neurosci..

[68]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[69]  Robin A. A. Ince,et al.  Representational interactions during audiovisual speech entrainment: Redundancy in left posterior superior temporal gyrus and synergy in left motor cortex , 2018, PLoS biology.

[70]  Jonas Obleser,et al.  Late cortical tracking of ignored speech facilitates neural selectivity in acoustically challenging conditions , 2019, NeuroImage.

[71]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[72]  D. Massaro Perceptual units in speech recognition. , 1974, Journal of experimental psychology.

[73]  James R. Glass,et al.  Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.

[74]  Young-Bum Kim,et al.  An overview of end-to-end language understanding and dialog management for personal digital assistants , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[75]  Margitta Seeck,et al.  Focal versus distributed temporal cortex activity for speech sound category assignment , 2017, Proceedings of the National Academy of Sciences.

[76]  P. Latham,et al.  Cracking the Neural Code for Sensory Perception by Combining Statistics, Intervention, and Behavior , 2017, Neuron.

[77]  J. Rauschecker,et al.  Phoneme and word recognition in the auditory ventral stream , 2012, Proceedings of the National Academy of Sciences.

[78]  Edmund C. Lalor,et al.  The Multivariate Temporal Response Function (mTRF) Toolbox: A MATLAB Toolbox for Relating Neural Signals to Continuous Stimuli , 2016, Front. Hum. Neurosci..

[79]  Robin A. A. Ince The Partial Entropy Decomposition: Decomposing multivariate entropy and mutual information via pointwise common surprisal , 2017, ArXiv.

[80]  Brian N. Pasley,et al.  Reconstructing Speech from Human Auditory Cortex , 2012, PLoS biology.

[81]  Mathieu Bourguignon,et al.  Comparing the potential of MEG and EEG to uncover brain tracking of speech temporal envelope , 2019, NeuroImage.

[82]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[83]  D. Cohen,et al.  Demonstration of useful differences between magnetoencephalogram and electroencephalogram. , 1983, Electroencephalography and clinical neurophysiology.

[84]  Sarah Verhulst,et al.  Computational modeling of the human auditory periphery: Auditory-nerve responses, evoked potentials and hearing loss , 2017, Hearing Research.

[85]  W. J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[86]  Robin A. A. Ince Measuring multivariate redundant information with pointwise common change in surprisal , 2016, Entropy.

[87]  Matthew S. Tata,et al.  Theta-band phase tracking in the two-talker problem , 2014, Brain and Language.

[88]  Okko Räsänen,et al.  Pre-linguistic segmentation of speech into syllable-like units , 2018, Cognition.

[89]  Adrian K. C. Lee,et al.  Auditory Brainstem Responses to Continuous Natural Speech in Human Listeners , 2017, eNeuro.

[90]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[91]  J. Macke,et al.  Neural population coding: combining insights from microscopic and mass signals , 2015, Trends in Cognitive Sciences.

[92]  B. T. Thomas Yeo,et al.  Inference in the age of big data: Future perspectives on neuroscience , 2017, NeuroImage.