Speaker normalization using cortical strip maps: a neural model for steady-state vowel categorization.

Auditory signals of speech are speaker dependent, but representations of language meaning are speaker independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by adaptive resonance theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [Peterson, G. E., and Barney, H.L., J. Acoust. Soc. Am. 24, 175-184 (1952).] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.

[1]  R. Reale,et al.  Tonotopic organization in auditory cortex of the cat , 1980, The Journal of comparative neurology.

[2]  H. Traunmüller Perceptual dimension of openness in vowels. , 1981, The Journal of the Acoustical Society of America.

[3]  I. Whitfield,et al.  Auditory cortex and the pitch of complex tones. , 1980, The Journal of the Acoustical Society of America.

[4]  S. Goldinger Words and voices: episodic traces in spoken word identification and recognition memory. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[5]  Christian Benoît,et al.  Audiovisual intelligibility of an androgynous speaker , 1997, AVSP.

[6]  S. Grossberg,et al.  Self-Organization of Binocular Disparity Tuning by Reciprocal Corticogeniculate Interactions , 1998, Journal of Cognitive Neuroscience.

[7]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[8]  R. Fay,et al.  Pitch : neural coding and perception , 2005 .

[9]  Keith Johnson,et al.  The role of perceived speaker identity in F0 normalization of vowels. , 1990, The Journal of the Acoustical Society of America.

[10]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[11]  S. Grossberg,et al.  Neural dynamics of attention switching and temporal-order information in short-term memory , 1988, Memory & cognition.

[12]  F H Guenther,et al.  Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. , 1995, Psychological review.

[13]  Hervé Bourlard,et al.  Using pitch frequency information in speech recognition , 2003, INTERSPEECH.

[14]  S. Williamson,et al.  Tonotopic organization of human auditory association cortex , 1994, Brain Research.

[15]  P. Heil,et al.  Frequency and periodicity are represented in orthogonal maps in the human auditory cortex: evidence from magnetoencephalography , 1997, Journal of Comparative Physiology A.

[16]  Roy D. Patterson,et al.  An analysis of the size information in classical formant data : Peterson and Barney (1952) revisited , 2003 .

[17]  S. Goldinger,et al.  Episodic encoding of voice attributes and recognition memory for spoken words. , 1993, Journal of experimental psychology. Learning, memory, and cognition.

[18]  S. Grossberg,et al.  Spikes, synchrony, and attentive learning by laminar thalamocortical circuits , 2006, Brain Research.

[19]  D. Schacter,et al.  Perceptual specificity of auditory priming: implicit memory for voice intonation and fundamental frequency. , 1994, Journal of experimental psychology. Learning, memory, and cognition.

[20]  S. Grossberg,et al.  How does a brain build a cognitive code? , 1980, Psychological review.

[21]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[22]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[23]  R. Patterson,et al.  The Processing of Temporal Pitch and Melody Information in Auditory Cortex , 2002, Neuron.

[24]  Sarah Hawkins,et al.  Roles and representations of systematic fine phonetic detail in speech understanding , 2003, J. Phonetics.

[25]  S. Grossberg,et al.  Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors , 1976, Biological Cybernetics.

[26]  Jeffrey S Bowers,et al.  Challenging the widespread assumption that connectionism and distributed representations go hand-in-hand , 2002, Cognitive Psychology.

[27]  Harvey M. Sussman,et al.  A neuronal model of vowel normalization and representation , 1986, Brain and Language.

[28]  Amro El-Jaroudi,et al.  A new spectral transformation for speaker normalization , 2003, INTERSPEECH.

[29]  T. M. Nearey Static, dynamic, and relational properties in vowel perception. , 1989, The Journal of the Acoustical Society of America.

[30]  A. Tunturi A difference in the representation of auditory signals for the left and right ears in the iso-frequency contours of the right middle ectosylvian auditory cortex of the dog. , 1952, The American journal of physiology.

[31]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  William J. Byrne,et al.  Speaker adaptation with all-pass transforms , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[33]  Corinne B. Moore,et al.  Speaker normalization in the perception of Mandarin Chinese tones. , 1997, The Journal of the Acoustical Society of America.

[34]  S. Grossberg Contour Enhancement , Short Term Memory , and Constancies in Reverberating Neural Networks , 1973 .

[35]  S Grossberg,et al.  A spectral network model of pitch perception. , 1995, The Journal of the Acoustical Society of America.

[36]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[38]  C. G. Henton,et al.  Towards an auditory theory of speaker normalization , 1984 .

[39]  Yu Sato,et al.  Interaction of excitatory and inhibitory frequency-receptive fields in determining fundamental frequency sensitivity of primary auditory cortex neurons in awake cats. , 2005, Cerebral cortex.

[40]  Stephen Grossberg,et al.  Fast-learning VIEWNET architectures for recognizing three-dimensional objects from multiple two-dimensional views , 1995, Neural Networks.

[41]  S. Grossberg,et al.  Neural dynamics of variable-rate speech categorization. , 1997, Journal of experimental psychology. Human perception and performance.

[42]  Peter Heil,et al.  Topographic representation of tone intensity along the isofrequency axis of cat primary auditory cortex , 1994, Hearing Research.

[43]  Daniel Bendor,et al.  Cortical representations of pitch in monkeys and humans , 2006, Current Opinion in Neurobiology.

[44]  Stefan Uppenkamp,et al.  Temporal dynamics of pitch in human auditory cortex , 2004, NeuroImage.

[45]  G. Mangun,et al.  Tonotopy in human auditory cortex examined with functional magnetic resonance imaging , 1997, Human brain mapping.

[46]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[47]  Leslie G. Ungerleider,et al.  The neural basis of biased competition in human visual cortex , 2001, Neuropsychologia.

[48]  R. Burkard,et al.  The functional anatomy of the normal human auditory system: responses to 0.5 and 4.0 kHz tones at varied intensities. , 1999, Cerebral cortex.

[49]  Stephen Grossberg,et al.  Speech Perception and Production by a Self-Organizing Neural Network. , 1987 .

[50]  Elizabeth A. Strand,et al.  Auditory–visual integration of talker gender in vowel perception , 1999 .

[51]  Stephen Grossberg,et al.  ARTSTREAM: a neural network model of auditory scene analysis and source segregation , 2004, Neural Networks.

[52]  B. Delgutte,et al.  Pitch of complex tones: rate-place and interspike interval representations in the auditory nerve. , 2005, Journal of neurophysiology.

[53]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[54]  M Kagoshima,et al.  Effects of Y-24180, a long-acting and potent antagonist to platelet-activating factor, on immediate asthmatic response in guinea pigs. , 1997, Pharmacology.

[55]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[56]  H. Spitzer,et al.  Increased attention enhances both behavioral and neuronal performance. , 1988, Science.

[57]  S. Grossberg,et al.  Normal and amnesic learning, recognition and memory by a neural model of cortico-hippocampal interactions , 1993, Trends in Neurosciences.

[58]  Stephen Grossberg,et al.  Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps , 1992, IEEE Trans. Neural Networks.

[59]  Benj. Ide Wheeler,et al.  Researches into the Nature of Vowel-Sound. , 1891 .

[60]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[61]  S. Grossberg The Link between Brain Learning, Attention, and Consciousness , 1999, Consciousness and Cognition.

[62]  S. Grossberg,et al.  The resonant dynamics of speech perception: interword integration and duration-dependent backward effects. , 2000, Psychological review.

[63]  S A Zahorian,et al.  Speaker normalization of static and dynamic vowel spectral features. , 1991, The Journal of the Acoustical Society of America.

[64]  J. Kaas,et al.  Subdivisions of auditory cortex and ipsilateral cortical connections of the parabelt auditory cortex in macaque monkeys , 1998, The Journal of comparative neurology.

[65]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[66]  P. L. Adams THE ORIGINS OF INTELLIGENCE IN CHILDREN , 1976 .

[67]  M. Page,et al.  Connectionist modelling in psychology: A localist manifesto , 2000, Behavioral and Brain Sciences.

[68]  Satrajit S. Ghosh,et al.  Neural modeling and imaging of the cortical interactions underlying syllable production , 2006, Brain and Language.

[69]  C. Douglas Creelman,et al.  Case of the Unknown Talker , 1957 .

[70]  N. Kanwisher,et al.  Discrimination Training Alters Object Representations in Human Extrastriate Cortex , 2006, The Journal of Neuroscience.

[71]  R. Ragot,et al.  Brain potentials as objective indexes of auditory pitch extraction from harmonics , 1996, Neuroreport.

[72]  Robert A Houde,et al.  Speech perception based on spectral peaks versus spectral shape. , 2006, The Journal of the Acoustical Society of America.

[73]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[74]  Stephen Grossberg,et al.  Neural dynamics of motion grouping: from aperture ambiguity to object speed and direction , 1997 .

[75]  Tamiko Azuma,et al.  Puzzle-solving science: the quixotic quest for units in speech perception , 2003, J. Phonetics.

[76]  Stephen Grossberg,et al.  ARTMAP: supervised real-time learning and classification of nonstationary data by a self-organizing neural network , 1991, [1991 Proceedings] IEEE Conference on Neural Networks for Ocean Engineering.

[77]  N. Logothetis,et al.  Functional Imaging Reveals Numerous Fields in the Monkey Auditory Cortex , 2006, PLoS biology.

[78]  Kelli J. Johnson The auditory/perceptual basis for speech segmentation , 1997 .

[79]  James D. Miller Auditory‐perceptual interpretation of the vowel , 1987 .

[80]  Colin Humphries,et al.  Tonotopic organization of human auditory cortex , 2010, NeuroImage.

[81]  Holger Schulze,et al.  Superposition of horseshoe‐like periodicity and linear tonotopic maps in auditory cortex of the Mongolian gerbil , 2002, The European journal of neuroscience.

[82]  J. Rauschecker,et al.  Processing of band-passed noise in the lateral auditory belt cortex of the rhesus monkey. , 2004, Journal of neurophysiology.

[83]  Raymond L. Watrous Current status of Peterson-Barney vowel formant data. , 1991, The Journal of the Acoustical Society of America.

[84]  Andrew J Oxenham,et al.  A Neural Representation of Pitch Salience in Nonprimary Human Auditory Cortex Revealed with Functional Magnetic Resonance Imaging , 2004, The Journal of Neuroscience.

[85]  Matthias J. Sjerps,et al.  Speaker Normalization in Speech Perception , 2008, The Handbook of Speech Perception.

[86]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[87]  Stephen Grossberg,et al.  Fast synchronization of perceptual grouping in laminar visual cortical circuits , 2004, Neural Networks.

[88]  M. P. Haggard,et al.  Vocal Tract Normalisation as Demonstrated by Reaction Times , 1975 .

[89]  S. Grossberg,et al.  A Self-Organizing Neural Model of Motor Equivalent Reaching and Tool Use by a Multijoint Arm , 1993, Journal of Cognitive Neuroscience.

[90]  S. Grossberg,et al.  The Hippocampus and Cerebellum in Adaptively Timed Learning, Recognition, and Movement , 1996, Journal of Cognitive Neuroscience.

[91]  Terrance M. Nearey,et al.  Speech signals, cues, and features , 1979 .

[92]  T. M. Nearey,et al.  Identification of resynthesized /hVd/ utterances: effects of formant contour. , 1999, The Journal of the Acoustical Society of America.

[93]  Janet B. Pierrehumbert,et al.  The next toolkit , 2006, J. Phonetics.

[94]  Stephen Grossberg,et al.  A Theory of Human Memory: Self-Organization and Performance of Sensory-Motor Codes, Maps, and Plans , 1982 .

[95]  S. Grossberg,et al.  A self-organizing neural system for learning to recognize textured scenes , 1999, Vision Research.

[96]  J. Kaas,et al.  Subdivisions and connections of auditory cortex in owl monkeys , 1992, The Journal of comparative neurology.

[97]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[98]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[99]  B. Delgutte,et al.  Neural correlates of the pitch of complex tones. II. Pitch shift, pitch ambiguity, phase invariance, pitch circularity, rate pitch, and the dominance region for pitch. , 1996, Journal of neurophysiology.

[100]  K. Lehnertz,et al.  Tonotopic organization of the human auditory cortex revealed by transient auditory evoked magnetic fields. , 1988, Electroencephalography and clinical neurophysiology.

[101]  S. Grossberg How does the cerebral cortex work? Development, learning, attention, and 3-D vision by laminar circuits of visual cortex. , 2003, Behavioral and cognitive neuroscience reviews.

[102]  A. Dale,et al.  Tonotopic organization in human auditory cortex revealed by progressions of frequency sensitivity. , 2004, Journal of neurophysiology.

[103]  G. E. Peterson Parameters of vowel quality. , 1961, Journal of speech and hearing research.

[104]  B. Delgutte,et al.  Neural correlates of the pitch of complex tones. I. Pitch and pitch salience. , 1996, Journal of neurophysiology.

[105]  Aníbal J S Ferreira,et al.  Static features in real-time recognition of isolated vowels at high pitch. , 2007, The Journal of the Acoustical Society of America.

[106]  J. Kaas,et al.  Cortical connections of electrophysiologically and architectonically defined subdivisions of auditory cortex in squirrels , 1988, The Journal of comparative neurology.

[107]  H. S. Gopal,et al.  A perceptual model of vowel recognition based on the auditory representation of American English vowels. , 1986, The Journal of the Acoustical Society of America.

[108]  Stephen Grossberg,et al.  A neural model of how the brain represents and compares multi-digit numbers: spatial and categorical processes , 2003, Neural Networks.

[109]  Stephen Grossberg,et al.  Parallel auditory filtering by sustained and transient channels separates coarticulated vowels and consonants , 1997, IEEE Trans. Speech Audio Process..

[110]  Vicki Bruce,et al.  Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect , 1995, Perception & psychophysics.

[111]  T. Imig,et al.  Organization of auditory cortex in the owl monkey (Aotus trivirgatus) , 1977, The Journal of comparative neurology.

[112]  S. Grossberg,et al.  View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds , 2009, Cognitive Psychology.

[113]  Stephen Grossberg,et al.  Resonant neural dynamics of speech perception , 2003, J. Phonetics.

[114]  J. Kaas,et al.  Tonotopic organization, architectonic fields, and connections of auditory cortex in macaque monkeys , 1993, The Journal of comparative neurology.

[115]  J. Rauschecker,et al.  Processing of complex sounds in the macaque nonprimary auditory cortex. , 1995, Science.

[116]  Tomaso Poggio,et al.  Trade-Off between Object Selectivity and Tolerance in Monkey Inferotemporal Cortex , 2007, The Journal of Neuroscience.

[117]  H M Sussman,et al.  An investigation of stop place of articulation as a function of syllable position: a locus equation perspective. , 1997, The Journal of the Acoustical Society of America.

[118]  Gail A. Carpenter,et al.  Distributed Learning, Recognition, and Prediction by ART and ARTMAP Neural Networks , 1997, Neural Networks.

[119]  J. Kaas,et al.  Subdivisions of auditory cortex and processing streams in primates. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[120]  S. Grossberg,et al.  Neural dynamics of perceptual order and context effects for variable-rate speech syllables , 1999, Perception & psychophysics.

[121]  I R Titze,et al.  Mechanical stress in phonation. , 1994, Journal of voice : official journal of the Voice Foundation.

[122]  K. Scheffler,et al.  Tonotopic organization of the human auditory cortex as detected by BOLD-FMRI , 1998, Hearing Research.

[123]  D. Shankweiler,et al.  What information enables a listener to map a talker's vowel space? , 1976, The Journal of the Acoustical Society of America.

[124]  A. Samuel,et al.  Perceptual adjustments to multiple speakers , 2007 .

[125]  R. Desimone Visual attention mediated by biased competition in extrastriate visual cortex. , 1998, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[126]  M Hoke,et al.  Tonotopic organization of the auditory cortex: pitch versus frequency representation. , 1989, Science.

[127]  Minami Ito,et al.  Size and position invariance of neuronal responses in monkey inferotemporal cortex. , 1995, Journal of neurophysiology.

[128]  D. Bendor,et al.  The neuronal representation of pitch in primate auditory cortex , 2005, Nature.

[129]  S Grossberg,et al.  3-D vision and figure-ground separation by visual cortex , 2010, Perception & psychophysics.

[130]  J Hillenbrand,et al.  Identification of steady-state vowels synthesized from the Peterson and Barney measurements. , 1993, The Journal of the Acoustical Society of America.

[131]  A. Slawson Vowel quality and musical timbre as functions of spectrum envelope and fundamental frequency. , 1968, The Journal of the Acoustical Society of America.

[132]  H. Traunmüller,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Comparative Study of the Male and Female Whispered and Phonated Versions of the Long Vowels of Swedish , 2022 .

[133]  Keith Johnson,et al.  Gradient and Visual Speaker Normalization in the Perception of Fricatives , 1996, KONVENS.

[134]  Ulrike Glavitsch Speaker normalization with respect to F0: a perceptual approach , 2003 .

[135]  J. Kaas,et al.  Subdivisions of AuditoryCortex and Levels of Processing in Primates , 1998, Audiology and Neurotology.

[136]  Gerald Sommer,et al.  Pattern Recognition by Self-Organizing Neural Networks , 1994 .

[137]  R. Goebel,et al.  Mirror-Symmetric Tonotopic Maps in Human Primary Auditory Cortex , 2003, Neuron.

[138]  P. Luce,et al.  Probabilistic Phonotactics and Neighborhood Activation in Spoken Word Recognition , 1999 .

[139]  L. Rabiner,et al.  CAN AUTOMATIC SPEECH RECOGNITION LEARN MORE FROM HUMAN SPEECH PERCEPTION ? , 2005 .

[140]  Keith Johnson,et al.  Resonance in an exemplar-based lexicon: The emergence of social identity and phonology , 2006, J. Phonetics.

[141]  Stephen Grossberg,et al.  Adaptive pattern classification and universal recoding: II. Feedback, expectation, olfaction, illusions , 1976, Biological Cybernetics.

[142]  S. Grossberg The complementary brain: unifying brain dynamics and modularity , 2000, Trends in Cognitive Sciences.

[143]  I Lehiste,et al.  Vowel and speaker identification in natural and synthetic speech. , 1973, Language and speech.

[144]  Raymond D. Kent,et al.  Acoustic Analysis of Speech , 2009 .

[145]  Mitchell Steinschneider,et al.  Pitch vs. spectral encoding of harmonic complex tones in primary auditory cortex of the awake monkey , 1998, Brain Research.

[146]  Stephen Grossberg,et al.  Fast Learning VIEWNET Architectures for Recognizing 3-D Objects from Multiple 2-D Views , 1995 .

[147]  M. Merzenich,et al.  Representation of the cochlear partition of the superior temporal plane of the macaque monkey. , 1973, Brain research.

[148]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.