Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner

We study continuous speech recognition based on sub-word units found in an unsupervised fashion. For agglutinative languages like Finnish, traditional word-based n-gram language modeling does not work well due to the huge number of different word forms. We use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient language modeling and unlimited vocabulary. The perplexity and speech recognition experiments on Finnish speech data show that the resulting model outperforms both word and syllable based trigram models. Compared to the word trigram model, the out-of-vocabulary rate is reduced from 20% to 0% and the word error rate from 56% to 32%.