A companding front end for noise-robust automatic speech recognition

Feature computation modules for automatic speech recognition (ASR) systems have long been modeled on the human auditory system. Most current ASR systems model the critical band response and equal loudness characteristics of the auditory system. It has been postulated that more detailed models of the human auditory system can lead to more noise-robust speech recognition. An auditory phenomenon that is of particular relevance to robustness is simultaneous masking, whereby dominant frequencies suppress adjacent weaker frequencies. In this paper, we present a companding-based model that mimics simultaneous masking in the front end of a speech recognizer. In an automotive digits recognition task, the front end improves word error rate by 4.0% (25% relative to Mel cepstra) at -5 dB SNR at the cost of a 1.7% increase at 15 dB SNR.

[1]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[2]  Richard F. Lyon,et al.  An analog electronic cochlea , 1988, IEEE Trans. Acoust. Speech Signal Process..

[3]  Rahul Sarpeshkar,et al.  A bio-inspired companding strategy for spectral enhancement , 2005, IEEE Transactions on Speech and Audio Processing.

[4]  E. Zwicker “Negative Afterimage” in Hearing , 1964 .

[5]  Rahul Sarpeshkar,et al.  THE SILICON COCHLEA: FROM BIOLOGY TO BIONICS , 2003 .

[6]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7]  Stephanie Seneff,et al.  Pitch and spectral analysis of speech based on an auditory synchrony model , 1985 .

[8]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[9]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .