Statistical language models for large vocabulary spontaneous speech recognition in dutch

In state-of-the-art large vocabulary automatic recognition systems, a large statistical language model is used, typically an N-gram. However in order to estimate this model, a large database of sentences or texts in the same style as the recognition task is needed. For spontaneous speech one doesn’t dispose of such database since it should consist of accurate thus expensive orthographic transcriptions of spoken audio. This paper investigates how readily available large news paper corpora can be used to improve languagemodels for spontaneous speech recognition although both language styles differ considerably. A technique is proposed that does a perplexity based automatic selection of appropriate news paper articles and that subsequently uses these texts in the language model estimation. Recognition experiments on spontaneous broadcast speech in Dutch showed significant improvements using this technique.