EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition

Despite the lack of invariance problem (the many-to-many mapping between acoustics and percepts), human listeners experience phonetic constancy and typically perceive what a speaker intends. Most models of human speech recognition (HSR) have side-stepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, carefully engineered deep learning networks allow robust, real-world automatic speech recognition (ASR). However, the complexities of deep learning architectures and training regimens make it difficult to use them to provide direct insights into mechanisms that may support HSR. In this brief article, we report preliminary results from a two-layer network that borrows one element from ASR, long short-term memory nodes, which provide dynamic memory for a range of temporal spans. This allows the model to learn to map real speech from multiple talkers to semantic targets with high accuracy, with human-like timecourse of lexical access and phonological competition. Internal representations emerge that resemble phonetically organized responses in human superior temporal gyrus, suggesting that the model develops a distributed phonological code despite no explicit training on phonetic or phonemic targets. The ability to work with real speech is a major advance for cognitive models of HSR.

[1]  J L Miller,et al.  Some effects of speaking rate on the production of /b/ and /w/. , 1983, The Journal of the Acoustical Society of America.

[2]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[3]  A. Liberman,et al.  The role of selected stimulus-variables in the perception of the unvoiced stop consonants. , 1952, The American journal of psychology.

[4]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[5]  Odette Scharenborg,et al.  Modeling the use of durational information in human spoken-word recognition. , 2010, The Journal of the Acoustical Society of America.

[6]  C. Fowler,et al.  Talkers' signaling of new and old. words in speech and listeners' perception and use of the distinction , 1987 .

[7]  R. Cole,et al.  Perception and production of fluent speech , 2016 .

[8]  A. Ishai,et al.  Recollection- and Familiarity-Based Decisions Reflect Memory Strength , 2008, Frontiers in systems neuroscience.

[9]  Paul D. Allopenna,et al.  Tracking the Time Course of Spoken Word Recognition Using Eye Movements: Evidence for Continuous Mapping Models , 1998 .

[10]  Matthew Botvinick,et al.  On the importance of single directions for generalization , 2018, ICLR.

[11]  Louis ten Bosch,et al.  How Should a Speech Recognizer Work? , 2005, Cogn. Sci..

[12]  Nikolaus Kriegeskorte,et al.  Deep Neural Networks in Computational Neuroscience , 2019 .

[13]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[14]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[15]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[18]  Jonathan Grainger,et al.  Spoken word recognition without a TRACE , 2013, Front. Psychol..

[19]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[20]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[21]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[22]  Michael Guerzhoy,et al.  Deep Neural Networks , 2013 .

[23]  J. Magnuson,et al.  TISK 1.0: An easy-to-use Python implementation of the time-invariant string kernel model of spoken word recognition , 2018, Behavior research methods.

[24]  D. Norris,et al.  Shortlist B: a Bayesian model of continuous speech recognition. , 2008, Psychological review.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  D. Plaut,et al.  A neurally plausible Parallel Distributed Processing model of Event-Related Potential word reading data , 2012, Brain and Language.

[27]  James L. McClelland,et al.  The TRACE model of speech perception , 1986, Cognitive Psychology.

[28]  David D. Cox,et al.  Untangling invariant object recognition , 2007, Trends in Cognitive Sciences.

[29]  S. Grossberg,et al.  Neural dynamics of variable-rate speech categorization. , 1997, Journal of experimental psychology. Human perception and performance.

[30]  James S. Magnuson,et al.  Nondeterminism, pleiotropy, and single-word reading: Theoretical and practical concerns , 2007 .

[31]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.