Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem

There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain.

[1]  Nikolaus Kriegeskorte,et al.  Representational Similarity Analysis – Connecting the Branches of Systems Neuroscience , 2008, Frontiers in systems neuroscience.

[2]  J. Rauschecker,et al.  Vowel sound extraction in anterior superior temporal cortex , 2006, Human brain mapping.

[3]  Dave R. M. Langers,et al.  Tonotopic mapping of human auditory cortex , 2014, Hearing Research.

[4]  L. Fadiga,et al.  The Motor Somatotopy of Speech Perception , 2009, Current Biology.

[5]  Feng Rong,et al.  Sensorimotor Integration in Speech Processing: Computational Basis and Neural Organization , 2011, Neuron.

[6]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[7]  P. Roach British English: Received Pronunciation , 2004, Journal of the International Phonetic Association.

[8]  Mathias Scharinger,et al.  Mental representations of vowel features asymmetrically modulate activity in superior temporal sulcus , 2016, Brain and Language.

[9]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[10]  Josh H. McDermott,et al.  Cortical Pitch Regions in Humans Respond Primarily to Resolved Harmonics and Are Located in Specific Tonotopic Regions of Anterior Auditory Cortex , 2013, The Journal of Neuroscience.

[11]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[12]  R. Patterson,et al.  A pulse ribbon model of monaural phase perception. , 1987, The Journal of the Acoustical Society of America.

[13]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[14]  Daniel Jones,et al.  On 'Received Pronunciation' , 1937 .

[15]  G. Orban,et al.  Comparative mapping of higher visual areas in monkeys and humans , 2004, Trends in Cognitive Sciences.

[16]  Jessica S. Arsenault,et al.  Distributed Neural Representations of Phonological Features during Speech Perception , 2015, The Journal of Neuroscience.

[17]  Milene Bonte,et al.  Decoding Articulatory Features from fMRI Responses in Dorsal Speech Regions , 2015, The Journal of Neuroscience.

[18]  Vinay Jayaram,et al.  Speech-specific tuning of neurons in human superior temporal gyrus. , 2014, Cerebral cortex.

[19]  Friedemann Pulvermüller,et al.  Motor cortex maps articulatory features of speech sounds , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Timothy D. Griffiths,et al.  A unified framework for the organization of the primate auditory cortex , 2013, Front. Syst. Neurosci..

[22]  Nikolaus Kriegeskorte,et al.  Perceptual similarity of visual patterns predicts dynamic neural activation patterns measured with MEG , 2016, NeuroImage.

[23]  W. Marslen-Wilson,et al.  Continuous uptake of acoustic cues in spoken word recognition , 1987, Perception & psychophysics.

[24]  M. S. Hämäläinen,et al.  Quantification of the benefit from integrating MEG and EEG data in minimum ℓ2-norm estimation , 2008, NeuroImage.

[25]  E. Chang,et al.  Categorical Speech Representation in Human Superior Temporal Gyrus , 2010, Nature Neuroscience.

[26]  William D. Marslen-Wilson,et al.  Brain Network Connectivity During Language Comprehension: Interacting Linguistic and Perceptual Subsystems , 2014, Cerebral cortex.

[27]  Matthew H. Davis,et al.  Hierarchical Processing in Spoken Language Comprehension , 2003, The Journal of Neuroscience.

[28]  Doris Y. Tsao,et al.  Neuroimaging Weighs In: Humans Meet Macaques in “Primate” Visual Cortex , 2003, The Journal of Neuroscience.

[29]  Thomas E. Nichols,et al.  Nonparametric permutation tests for functional neuroimaging: A primer with examples , 2002, Human brain mapping.

[30]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[31]  J. Elman On the Meaning of Words and Dinosaur Bones: Lexical Knowledge Without a Lexicon , 2009, Cogn. Sci..

[32]  Li Su,et al.  Spatiotemporal Searchlight Representational Similarity Analysis in EMEG Source Space , 2012, 2012 Second International Workshop on Pattern Recognition in NeuroImaging.

[33]  Chao Zhang,et al.  A general artificial neural network extension for HTK , 2015, INTERSPEECH.

[34]  Paul Warren,et al.  Cues to lexical choice: Discriminating place and voice , 1988, Perception & psychophysics.

[35]  Esa Alhoniemi,et al.  SOM Toolbox for Matlab 5 , 2000 .

[36]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[38]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[39]  P. Ladefoged A course in phonetics , 1975 .

[40]  R. Ilmoniemi,et al.  Interpreting magnetic fields of the brain: minimum norm estimates , 2006, Medical and Biological Engineering and Computing.

[41]  Muge M. Bakircioglu,et al.  Mapping visual cortex in monkeys and humans using surface-based atlases , 2001, Vision Research.

[42]  Andreas Stolcke,et al.  Articulatory trajectories for large-vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  D. V. van Essen,et al.  The Processing of Visual Shape in the Cerebral Cortex of Human and Nonhuman Primates: A Functional Magnetic Resonance Imaging Study , 2004, The Journal of Neuroscience.

[44]  Martin Luessi,et al.  MNE software for processing MEG and EEG data , 2014, NeuroImage.

[45]  R N Shepard,et al.  Multidimensional Scaling, Tree-Fitting, and Clustering , 1980, Science.

[46]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[47]  Elia Formisano,et al.  An anatomical and functional topography of human auditory cortical areas , 2014, Front. Neurosci..

[48]  Dimitrios Pantazis,et al.  Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks , 2015, NeuroImage.

[49]  L. Tyler,et al.  Predicting the Time Course of Individual Objects with MEG , 2014, Cerebral cortex.

[50]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[51]  Marcel A. J. van Gerven,et al.  Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream , 2014, The Journal of Neuroscience.

[52]  N. Kriegeskorte,et al.  Representational geometry: integrating cognition, computation, and the brain , 2013, Trends in Cognitive Sciences.

[53]  D. Margoliash,et al.  A Mechanism for Frequency Modulation in Songbirds Shared with Humans , 2013, The Journal of Neuroscience.

[54]  Elisabeth Fonteneau,et al.  Mapping tonotopic organization in human temporal cortex: representational similarity analysis in EMEG source space , 2014, Front. Neurosci..

[55]  K. Watkins,et al.  Motor Representations of Articulators Contribute to Categorical Perception of Speech Sounds , 2009, The Journal of Neuroscience.

[56]  W Marslen-Wilson,et al.  Levels of perceptual representation and process in lexical access: words, phonemes, and features. , 1994, Psychological review.

[57]  Keiji Tanaka,et al.  Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey , 2008, Neuron.

[58]  Erik Edwards,et al.  Parallel streams define the temporal dynamics of speech processing across human auditory cortex , 2017, bioRxiv.

[59]  Roy D. Patterson,et al.  Tracking cortical entrainment in neural activity: auditory processes in human temporal cortex , 2015, Front. Comput. Neurosci..

[60]  N. Kriegeskorte,et al.  Perceptual similarity of visual patterns predicts dynamic neural activation patterns measured with MEG , 2015, NeuroImage.

[61]  Li Su,et al.  A Toolbox for Representational Similarity Analysis , 2014, PLoS Comput. Biol..

[62]  Jonas Obleser,et al.  Magnetic Brain Response Mirrors Extraction of Phonological Features from Spoken Vowels , 2004, Journal of Cognitive Neuroscience.

[63]  Doris Y. Tsao,et al.  Anatomical Connections of the Functionally Defined “Face Patches” in the Macaque Monkey , 2016, Neuron.

[64]  Rainer Goebel,et al.  Information-based functional brain mapping. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[65]  J. Rauschecker,et al.  Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing , 2009, Nature Neuroscience.

[66]  Edward F Chang,et al.  The auditory representation of speech sounds in human motor cortex , 2016, eLife.

[67]  J. Rauschecker,et al.  Segregation of Vowels and Consonants in Human Auditory Cortex: Evidence for Distributed Hierarchical Organization , 2010, Front. Psychology.

[68]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.