EARSHOT: A minimal network model of human speech recognition that operates on real speech

Despite the lack of invariance problem (the many-to-many mapping between acoustics and percepts), we experience phonetic constancy and typically perceive what a speaker intends. Models of human speech recognition have sidestepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, automatic speech recognition powered by deep learning networks have allowed robust, real-world speech recognition. However, the complexities of deep learning architectures and training regimens make it difficult to use them to provide direct insights into mechanisms that may support human speech recognition. We developed a simple network that borrows one element from automatic speech recognition (long short-term memory nodes, which provide dynamic memory for short and long spans). This allows the network to learn to map real speech from multiple talkers to semantic targets with high accuracy. Internal representations emerge that resemble phonetically-organized responses in human superior temporal gyrus, suggesting that the model develops a distributed phonological code despite no explicit training on phonetic or phonemic targets. The ability to work with real speech is a major advance for cognitive models of human speech recognition.

[1]  C. Fowler,et al.  Talkers' signaling of new and old. words in speech and listeners' perception and use of the distinction , 1987 .

[2]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[3]  D. Plaut,et al.  A neurally plausible Parallel Distributed Processing model of Event-Related Potential word reading data , 2012, Brain and Language.

[4]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[5]  S. Grossberg,et al.  Neural dynamics of variable-rate speech categorization. , 1997, Journal of experimental psychology. Human perception and performance.

[6]  Jonathan Grainger,et al.  Spoken word recognition without a TRACE , 2013, Front. Psychol..

[7]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[8]  J. Magnuson,et al.  TISK 1.0: An easy-to-use Python implementation of the time-invariant string kernel model of spoken word recognition , 2018, Behavior research methods.

[9]  James L. McClelland,et al.  The TRACE model of speech perception , 1986, Cognitive Psychology.

[10]  David D. Cox,et al.  Untangling invariant object recognition , 2007, Trends in Cognitive Sciences.

[11]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[12]  James S. Magnuson,et al.  Nondeterminism, pleiotropy, and single-word reading: Theoretical and practical concerns , 2007 .

[13]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[15]  Jay G Rueckl,et al.  EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition , 2020, Cogn. Sci..

[16]  D. Norris,et al.  Shortlist B: a Bayesian model of continuous speech recognition. , 2008, Psychological review.

[17]  A. Liberman,et al.  The role of selected stimulus-variables in the perception of the unvoiced stop consonants. , 1952, The American journal of psychology.

[18]  Odette Scharenborg,et al.  Modeling the use of durational information in human spoken-word recognition. , 2010, The Journal of the Acoustical Society of America.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[21]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  J L Miller,et al.  Some effects of speaking rate on the production of /b/ and /w/. , 1983, The Journal of the Acoustical Society of America.

[24]  Louis ten Bosch,et al.  How Should a Speech Recognizer Work? , 2005, Cogn. Sci..

[25]  Paul D. Allopenna,et al.  Tracking the Time Course of Spoken Word Recognition Using Eye Movements: Evidence for Continuous Mapping Models , 1998 .