Language model accuracy and uncertainty in noise cancelling in the stochastic weighted viterbi algorithm

In this paper, the Stochastic Weighted Viterbi (SWV) decoding is combined with language modeling, which in turn guides the Viterbi decoding in those intervals where the information provided by noisy frames is not reliable. In other words, the knowledge from higher layers (e.g. language model) compensates the low accuracy of the information provided by the acousticphonetic modeling where the original clean speech signal is not reliably estimated. Bigram and trigram language models are tested, and in combination with spectral subtraction, the SWV algorithm can lead to reductions as high as 20% or 45% in word error rate (WER) using a rough estimation of the additive noise made in a short non-speech interval. Also, the results presented here suggest that the higher the language model accuracy, the higher the improvement due to SWV. This paper proposes that the problem of noise robustness in speech recognition should be classified in two different contexts: firstly, at the acoustic-phonetic level only, as in small vocabulary tasks with flat language model; and, by integrating noise canceling with the information from higher layers.