A layered approach for dutch large vocabulary continuous speech recognition

In this paper we investigate whether a layered architecture that has already proven its value for small tasks, works for a system with large lexica (400k words) and language models (5-grams) as well. The architecture was designed to decouple phone and word recognition which allows for the integration of more complex linguistic components, especially at the sub-word level. It was tested on the Dutch language which - with its large variety of accents and rich morphology - is ideally suited to benefit from this integration. The results reveal that the architecture is already competitive to an all-in-one approach in which acoustic models, language models and lexicon are all applied simultaneously. Candidates for further improvement to the system based on a conditional phone confusion model are suggested.