A Neural Network System for Large-Vocabulary Continuous Speech Recognition in Variable Acoustic Environments

Performance of speech recognizers is typically degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent as the microphone is positioned more distant from the speaker, for instance, in a teleconferencing application. Mismatched training and testing conditions, such as frequency response, microphone, signal-to-noise ratio (SNR), and room reverberation, also degrade recognition performance. Among available approaches to handling mismatches between training and testing conditions, a popular one is to retrain the speech recognizer under new environments. Hidden Markov models (HMM) have to date been accepted as an effective classification method for large vocabulary continuous speech recognition, e.g., the ARPA-sponsored SPHINX and DECIPHER. Retraining of HMM-based recognizers is a complex and tedious task. It requires recollection of speech data under corresponding conditions and reestimation of HMM's parameters. Particularly great time and effort are needed to retrain a recognizer which operates in a speaker-independent mode, which is the mode of greatest general interest.