A computational model of sound recognition used to analyze the capacity and adaptability in learning vowel classes

Sound recognition is likely to initiate early in auditory processing and use stored representations (spectrotemporal templates) to compare against spectral information from auditory brainstem responses over time. A computational model of sound recognition is developed using neurobiologically plausible operations. The adaptability and number of templates required for the computational model to correctly recognize 10 Klatt-synthesized vowels is determined to be around 1250 templates when trained with random fundamental frequencies from the male pitch range and randomized variation of the first three formants of each vowel. To investigate the ability to adapt to noise and other unheard vowel utterances, test sets with 1000 randomly generated Klatt vowels in babble at signal-to-noise ratios (SNRs) of 20 dB, 10 dB, 5 dB, 0 dB, and 􀀀5 dB are generated. The vowel recognition rates at each SNR are 99.7%, 99.6%, 97.0%, 77.6%, and 54.0%, respectively. Also, a test set of four vowel recordings from four speakers is tested with no noise, giving 100% recognition rate. These data suggest that storage of auditory representations for speech at the spectrotemporal resolution of the auditory nerve over a typical range of spoken pitch does not require excessive memory resources or computing to implement on parallel computer systems.

[1]  D. Sanes,et al.  Afferent Regulation of Inhibitory Synaptic Transmission in the Developing Auditory Midbrain , 2000, The Journal of Neuroscience.

[2]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[3]  Dan H. Sanes,et al.  Conductive Hearing Loss Disrupts Synaptic and Spike Adaptation in Developing Auditory Cortex , 2007, The Journal of Neuroscience.

[4]  Neil McLachlan,et al.  A neurocognitive model of recognition and pitch segregation. , 2011, The Journal of the Acoustical Society of America.

[5]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[6]  D. Sanes,et al.  Deafness Disrupts Chloride Transporter Function and Inhibitory Synaptic Transmission , 2003, The Journal of Neuroscience.

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Arjan van Hessen,et al.  The end of categorical perception as we know it , 2003, Speech Commun..

[9]  G. Fairbanks,et al.  A psychophysical investigation of vowel formants. , 1961, Journal of speech and hearing research.

[10]  J W Hawks,et al.  A formant bandwidth estimation procedure for vowel synthesis [43.72.Ja]. , 1995, The Journal of the Acoustical Society of America.

[11]  A. Neel,et al.  Vowel space characteristics and vowel identification accuracy. , 2008, Journal of speech, language, and hearing research : JSLHR.

[12]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[13]  J Hillenbrand,et al.  Identification of steady-state vowels synthesized from the Peterson and Barney measurements. , 1993, The Journal of the Acoustical Society of America.

[14]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[15]  Li Deng,et al.  Structure-based and template-based automatic speech recognition - comparing parametric and non-parametric approaches , 2007, INTERSPEECH.

[16]  Alex Acero,et al.  Noise Adaptive Training for Robust Automatic Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Sarah J Wilson,et al.  The central role of recognition in auditory perception: a neurobiological model. , 2010, Psychological review.

[18]  N. Tritsch,et al.  The origin of spontaneous activity in the developing auditory system , 2007, Nature.

[19]  J. Borst,et al.  Calcium action potentials in hair cells pattern auditory neuron activity before hearing onset , 2010, Nature Neuroscience.

[20]  N. Viemeister,et al.  Temporal integration and multiple looks. , 1991, The Journal of the Acoustical Society of America.

[21]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[22]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[23]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  N. McLachlan A computational model of human pitch strength and height judgments , 2009, Hearing Research.

[25]  W. Fieger The Multivariate Normal Distribution - Y. L Tong. , 1995 .

[26]  Chang Liu,et al.  English vowel identification in long-term speech-shaped noise and multi-talker babble for English and Chinese listeners. , 2013, The Journal of the Acoustical Society of America.