A time-invariant connectionist model of spoken word recognition Thomas Hannagan (thom.hannagan@gmail.com) CNRS & Aix-Marseille University 3, place Victor Hugo, 13331 Marseille, France James S. Magnuson (james.magnuson@uconn.edu) Department of Psychology, University of Connecticut, 406 Babbidge Road, Unit 1020, Storrs, CT 06269-1020 USA and Haskins Laboratories, 300 George St., New Haven, CT 06511 USA Jonathan Grainger (i.jonathan.grainger@gmail.com) CNRS & Aix-Marseille University 3, place Victor Hugo, 13331 Marseille, France Abstract One of the largest remaining unsolved mysteries in cognitive science is how the rapid input of spoken language is mapped onto phonological and lexical representations over time. Attempts at psychologically-tractable computational models of spoken word recognition tend either to ignore time or to transform the temporal input into a spatial representation. This is the approach taken in TRACE (McClelland & Elman, 1986), the model of spoken word recognition that has the broadest and deepest coverage of phenomena in speech perception, spoken word recognition, and lexical parsing of multi-word sequences. TRACE reduplicates featural, phonemic, and lexical inputs at every time step in a potentially very large memory trace, and has rich interconnections (excitatory forward and backward connections between levels and inhibitory links within levels). This leads to a rather extreme proliferation of units and connections that grows dramatically as the lexicon or the memory trace grows. Our starting point is the observation that models of visual object recognition – including visual word recognition – have long grappled with the fundamental problem of how to model spatial invariance in human object recognition. We introduce a model that combines one aspect of TRACE – time-specific phoneme representations – and higher-level representations that have been used in visual word recognition – spatially- (here, temporally-) independent diphone and lexical units. This reduces the number of units and connections required by several orders of magnitude relative to TRACE. In this first report, we demonstrate that the model (dubbed TISK, for Time-Invariant String Kernel) achieves reasonable accuracy for the basic TRACE lexicon and successfully models the time course of phonological activation and competition. We close with a discussion of phenomena that the model does not yet successfully simulate (and why), and with novel predictions that follow from this architecture. Keywords: Keywords: Spoken Word Recognition; Time invariance ; Computational models; TRACE. Background Could it be that despite very salient differences, the auditory and visual systems actually rely on the same mechanisms in order to recognize words? One signal has a temporal dimension and is carried by transient sound waves, the other is spatially extended and travels at the speed of light. One signal travels sequentially (over time) through the cochlear nerve, the other in parallel through the optic nerve. In their own dedicated primary cortical regions, however, both arrive at spatial representations – tonotopic for the auditory system, retinotopic for the visual system. What happens next, according to computational models of visual and spoken word recognition, further hints at some possible unification. Modeling spoken and visual word recognition: TRACE and IA From a psycholinguistic point of view, two early models of word recognition based on the same computational framework have been enormously successful. In the visual domain, the Interactive Activation (IA) model and its extensions (McClelland & Rumelhart, 1981; Grainger & Jacobs, 1996) can account for a large number of robust and sometimes counterintuitive behavioral findings, in a simple and elegant hierarchical structure where units at any level compete to represent the stimulus, and engage in lobbying up and down in the hierarchy. In the auditory domain, TRACE (an extension of the IA framework for speech; McClelland & Elman, 1986) continues to produce new insights into human behavior, including close fits to fine- grained estimates of the time course of spoken word recognition from the visual world paradigm (Allopenna et al., 1998; Dahan, Magnuson, Tanenhaus, & Hogan, 2001); Dahan, Magnuson, & Tanenhaus, 2001). 1 One probably superficial difference between the two models is that between-level connections in IA models of reading typically include both inhibitory and excitatory connections, whereas between-level connections in TRACE It is important to note that current, psychologically tractable models of spoken word recognition do not take real speech as their input. While Grossberg & Myers (2000) have modeled aspects of speech and word processing using real speech inputs, these efforts have not yet yielded a model that can handle speech input and a broad range of phenomena in spoken word recognition. In order to be able to address complex issues in word recognition without first solving all fundamental problems in speech perception, TRACE’s inputs (for example) are pseudo-spectral acoustic-phonetic features that ramp on and off over time, with temporal overlap between adjacent phonemes providing a coarse analog of coarticulation.
[1]
Jonathan Grainger,et al.
Broken Symmetries in a Location-Invariant Word Recognition Network
,
2011,
Neural Computation.
[2]
Michael N Jones,et al.
Representing word meaning and order information in a composite holographic lexicon.
,
2007,
Psychological review.
[3]
M. Tanenhaus,et al.
Subcategorical mismatches and the time course of lexical access: Evidence for lexical competition
,
2001
.
[4]
Paul D. Allopenna,et al.
Tracking the Time Course of Spoken Word Recognition Using Eye Movements: Evidence for Continuous Mapping Models
,
1998
.
[5]
Emily B. Myers,et al.
Speaker invariance for phonetic information: An fMRI investigation
,
2012,
Language and cognitive processes.
[6]
Jonathan Grainger,et al.
Neural networks for word recognition: Is a hidden layer necessary?
,
2010
.
[7]
J. Shawe-Taylor.
Building symmetries into feedforward networks
,
1989
.
[8]
Alex Smola,et al.
Kernel methods in machine learning
,
2007,
math/0701907.
[9]
James L. McClelland,et al.
The TRACE model of speech perception
,
1986,
Cognitive Psychology.
[10]
S. Grossberg,et al.
The resonant dynamics of speech perception: interword integration and duration-dependent backward effects.
,
2000,
Psychological review.
[11]
S. Grossberg,et al.
The resonant dynamics of speech perception: interword integration and duration-dependent backward effects.
,
2000
.
[12]
Lori L. Holt,et al.
Are there interactive processes in speech perception?
,
2006,
Trends in Cognitive Sciences.
[13]
Bharath Chandrasekaran,et al.
Neural Processing of What and Who Information during Spoken Language Processing
,
2011
.
[14]
J Grainger,et al.
Orthographic processing in visual word recognition: a multiple read-out model.
,
1996,
Psychological review.
[15]
M. Tanenhaus,et al.
Time Course of Frequency Effects in Spoken-Word Recognition: Evidence from Eye Movements
,
2001,
Cognitive Psychology.
[16]
S Lehéricy,et al.
The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients.
,
2000,
Brain : a journal of neurology.
[17]
S. Dehaene,et al.
Direct Intracranial, fMRI, and Lesion Evidence for the Causal Role of Left Inferotemporal Cortex in Reading
,
2006,
Neuron.
[18]
Anne Cutler,et al.
Are there really interactive processes in speech perception?
,
2006,
Trends in Cognitive Sciences.
[19]
Bharath Chandrasekaran,et al.
Neural Processing of What and Who Information in Speech
,
2011,
Journal of Cognitive Neuroscience.
[20]
James L. McClelland,et al.
An interactive activation model of context effects in letter perception: part 1.: an account of basic findings
,
1988
.
[21]
Mark J. F. Gales,et al.
Sequence Kernels for Speaker and Speech Recognition
,
2009
.
[22]
Lori L. Holt,et al.
Response to McQueen et al.: Theoretical and empirical arguments support interactive processing
,
2006,
Trends in Cognitive Sciences.
[23]
James L. McClelland,et al.
An interactive activation model of context effects in letter perception: I. An account of basic findings.
,
1981
.
[24]
Jonathan Grainger,et al.
Learning location-invariant orthographic representations for printed words
,
2010,
Connect. Sci..