In this paper, the Stochastic Weighted Viterbi (SWV) decoding is combined with language modeling, which in turn guides the Viterbi decoding in those intervals where the information provided by noisy frames is not reliable. In other words, the knowledge from higher layers (e.g. language model) compensates the low accuracy of the information provided by the acousticphonetic modeling where the original clean speech signal is not reliably estimated. Bigram and trigram language models are tested, and in combination with spectral subtraction, the SWV algorithm can lead to reductions as high as 20% or 45% in word error rate (WER) using a rough estimation of the additive noise made in a short non-speech interval. Also, the results presented here suggest that the higher the language model accuracy, the higher the improvement due to SWV. This paper proposes that the problem of noise robustness in speech recognition should be classified in two different contexts: firstly, at the acoustic-phonetic level only, as in small vocabulary tasks with flat language model; and, by integrating noise canceling with the information from higher layers.
[1]
Hynek Hermansky,et al.
Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP)
,
1991,
EUROSPEECH.
[2]
David Pearce,et al.
The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions
,
2000,
INTERSPEECH.
[3]
Philip C. Loizou.
Speaker Verification in Noise Using a Stochastic Version of the Weighted Viterbi Algorithm
,
2002
.
[4]
Saeed Vaseghi,et al.
Noise compensation methods for hidden Markov model speech recognition in adverse environments
,
1997,
IEEE Trans. Speech Audio Process..
[5]
Mark J. F. Gales,et al.
HMM recognition in noise using parallel model combination
,
1993,
EUROSPEECH.
[6]
Mervyn A. Jack,et al.
Improving performance of spectral subtraction in speech recognition using a model for additive noise
,
1998,
IEEE Trans. Speech Audio Process..