An Adaptive Psychoacoustic Model for Automatic Speech Recognition

Compared with automatic speech recognition (ASR), the human auditory system is more adept at handling noise-adverse situations, including environmental noise and channel distortion. To mimic this adeptness, auditory models have been widely incorporated in ASR systems to improve their robustness. This paper proposes a novel auditory model which incorporates psychoacoustics and otoacoustic emissions (OAEs) into ASR. In particular, we successfully implement the frequency-dependent property of psychoacoustic models and effectively improve resulting system performance. We also present a novel double-transform spectrum-analysis technique, which can qualitatively predict ASR performance for different noise types. Detailed theoretical analysis is provided to show the effectiveness of the proposed algorithm. Experiments are carried out on the AURORA2 database and show that the word recognition rate using our proposed feature extraction method is significantly increased over the baseline. Given models trained with clean speech, our proposed method achieves up to 85.39% word recognition accuracy on noisy data.

[1]  S. Shamma Speech processing in the auditory system. I: The representation of speech sounds in the responses of the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[2]  S. Kujawa,et al.  Time-varying alterations in the f2–f1 DPOAE response to continuous primary stimulation II. Influence of local calcium-dependent mechanisms , 1996, Hearing Research.

[3]  Li Deng,et al.  Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features , 2004, IEEE Transactions on Speech and Audio Processing.

[4]  Ing Yann Soon,et al.  A temporal warped 2D psychoacoustic modeling for robust speech recognition system , 2011, Speech Commun..

[5]  S. Yeo,et al.  Clinical characteristics and audiological significance of spontaneous otoacoustic emissions in tinnitus patients with normal hearing , 2010, The Journal of Laryngology & Otology.

[6]  K. Markou,et al.  Clinically isolated syndrome manifested as acute vestibular syndrome: bedside neuro-otological examination and suppression of transient evoked otoacoustic emissions in the differential diagnosis. , 2014, American journal of otolaryngology.

[7]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[8]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  G. Zweig,et al.  The origin of periodicity in the spectrum of evoked otoacoustic emissions. , 1995, The Journal of the Acoustical Society of America.

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  S. Shamma Speech processing in the auditory system. II: Lateral inhibition and the central processing of speech evoked activity in the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[12]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[13]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[14]  Nathalie Virag,et al.  Single channel speech enhancement based on masking properties of the human auditory system , 1999, IEEE Trans. Speech Audio Process..

[15]  Martin S. Robinette,et al.  Otoacoustic Emissions: Clinical Applications , 1997 .

[16]  John H. L. Hansen,et al.  Recent Advances in Robust Speech Recognition Technology , 2012 .

[17]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  A. Oxenham,et al.  Forward masking: adaptation or integration? , 2001, The Journal of the Acoustical Society of America.

[19]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[20]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[21]  Andrew J. Oxenham,et al.  Effects of masker frequency and duration in forward masking: further evidence for the influence of peripheral nonlinearity , 2000, Hearing Research.

[22]  G. Long,et al.  Modeling the combined effects of basilar membrane nonlinearity and roughness on stimulus frequency otoacoustic emission fine structure. , 2000, The Journal of the Acoustical Society of America.

[23]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[24]  Y. Raz,et al.  Otoacoustic Emissions: Clinical Applications , 2007 .

[25]  S. Norton,et al.  Efferently mediated changes in the quadratic distortion product (f2−f1) , 1997 .

[26]  Stuart Rosen,et al.  Listening to speech in a background of other talkers: effects of talker number and noise vocoding. , 2013, The Journal of the Acoustical Society of America.

[27]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[28]  C. Schreiner,et al.  Short-term adaptation of auditory receptive fields to dynamic stimuli. , 2004, Journal of neurophysiology.

[29]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[30]  Ing Yann Soon,et al.  A temporal frequency warped (TFW) 2D psychoacoustic filter for robust speech recognition system , 2012, Speech Commun..

[31]  D. Kemp Stimulated acoustic emissions from within the human auditory system. , 1978, The Journal of the Acoustical Society of America.

[32]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[33]  L. Auger The Journal of the Acoustical Society of America , 1949 .

[34]  Xiaosong Wang,et al.  Phase-sensitive speech enhancement for cochlear implant processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Abeer Alwan,et al.  A model of dynamic auditory perception and its application to robust word recognition , 1997, IEEE Trans. Speech Audio Process..

[36]  Ing Yann Soon,et al.  2D psychoacoustic filtering for robust speech recognition , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).

[37]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[38]  Michael Vorländer,et al.  Handbook of signal processing in acoustics , 2008 .