Automatic speech recognition with an adaptation model motivated by auditory processing

The mel-frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) feature extraction typically used for automatic speech recognition (ASR) employ several principles which have known counterparts in the cochlea and auditory nerve: frequency decomposition, mel- or bark-warping of the frequency axis, and compression of amplitudes. It seems natural to ask if one can profitably employ a counterpart of the next physiological processing step, synaptic adaptation. We, therefore, incorporated a simplified model of short-term adaptation into MFCC feature extraction. We evaluated the resulting ASR performance on the AURORA 2 and AURORA 3 tasks, in comparison to ordinary MFCCs, MFCCs processed by RASTA, and MFCCs processed by cepstral mean subtraction (CMS), and both in comparison to and in combination with Wiener filtering. The results suggest that our approach offers a simple, causal robustness strategy which is competitive with RASTA, CMS, and Wiener filtering and performs well in combination with Wiener filtering. Compared to the structurally related RASTA, our adaptation model provides superior performance on AURORA 2 and, if Wiener filtering is used prior to both approaches, on AURORA 3 as well.

[1]  Abeer Alwan,et al.  A model of dynamic auditory perception and its application to robust word recognition , 1997, IEEE Trans. Speech Audio Process..

[2]  L. A. Westerman,et al.  Rapid and short-term adaptation in auditory nerve responses , 1984, Hearing Research.

[3]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4]  Ray Meddis,et al.  Adaptation in a revised inner-hair cell model. , 2003, The Journal of the Acoustical Society of America.

[5]  Alexander Joseph Book reviewDischarge patterns of single fibers in the cat's auditory nerve: Nelson Yuan-Sheng Kiang, with the assistance of Takeshi Watanabe, Eleanor C. Thomas and Louise F. Clark: Research Monograph no. 35. Cambridge, Mass., The M.I.T. Press, 1965 , 1967 .

[6]  C E Schreiner,et al.  Time course of adaptation and recovery from adaptation in the cat auditory-nerve neurophonic. , 1990, The Journal of the Acoustical Society of America.

[7]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[8]  I. Whitfield Discharge Patterns of Single Fibers in the Cat's Auditory Nerve , 1966 .

[9]  Ray Meddis,et al.  A revised model of the inner-hair cell and auditory-nerve complex. , 2002, The Journal of the Acoustical Society of America.

[10]  C. Schreiner,et al.  Short-term adaptation of auditory receptive fields to dynamic stimuli. , 2004, Journal of neurophysiology.

[11]  I. Nelken,et al.  Processing of low-probability sounds by cortical neurons , 2003, Nature Neuroscience.

[12]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[13]  Birger Kollmeier Auditory principles in speech processing - do computers need silicon ears ? , 2003, INTERSPEECH.

[14]  A. Oxenham,et al.  Forward masking: adaptation or integration? , 2001, The Journal of the Acoustical Society of America.

[15]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[16]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[17]  William M. Hartmann,et al.  Psychoacoustics: Facts and Models , 2001 .

[18]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[19]  Chafic Mokbel,et al.  Deconvolution of telephone line effects for speech recognition , 1996, Speech Commun..

[20]  J Tchorz,et al.  A model of auditory perception as front end for automatic speech recognition. , 1999, The Journal of the Acoustical Society of America.

[21]  Werner Hemmert,et al.  Auditory-based automatic speech recognition , 2004, SAPA@INTERSPEECH.

[22]  Birger Kollmeier,et al.  Combining speech enhancement and auditory feature extraction for robust speech recognition , 2000, Speech Commun..

[23]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[24]  W. S. Rhode,et al.  Characteristics of tone-pip response patterns in relationship to spontaneous rate in cat auditory nerve fibers , 1985, Hearing Research.

[25]  F. Perdigão,et al.  AUDITORY MODELS AS FRONT-ENDS FOR SPEECH RECOGNITION , 1998 .

[26]  Misha Pavel,et al.  Intelligibility of speech with filtered time trajectories of spectral envelopes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[27]  Hamid Sheikhzadeh,et al.  Speech analysis and recognition using interval statistics generated from a composite auditory model , 1998, IEEE Trans. Speech Audio Process..

[28]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[29]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[30]  Lou Boves,et al.  Channel normalization techniques for automatic speech recognition over the telephone , 1998, Speech Commun..

[31]  R. Smith Short-term adaptation in single auditory nerve fibers: some poststimulatory effects. , 1977 .