论文信息 - Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner

Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner

We study continuous speech recognition based on sub-word units found in an unsupervised fashion. For agglutinative languages like Finnish, traditional word-based n-gram language modeling does not work well due to the huge number of different word forms. We use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient language modeling and unlimited vocabulary. The perplexity and speech recognition experiments on Finnish speech data show that the resulting model outperforms both word and syllable based trigram models. Compared to the word trigram model, the out-of-vocabulary rate is reduced from 20% to 0% and the word error rate from 56% to 32%.

Mikko Kurimo | Mathias Creutz | Teemu Hirsimäki | Vesa Siivola

[1] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..

[2] Xavier L. Aubert,et al. An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[3] Mark Huckvale,et al. Using phonologically-constrained morphological analysis in continuous speech recognition , 2002, Comput. Speech Lang..

[4] Mathias Creutz,et al. Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[5] Ronald Rosenfeld,et al. Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[6] Jorma Rissanen,et al. Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[7] Dietrich Klakow,et al. Speech recognition for huge vocabularies by using optimized sub-word units , 2001, INTERSPEECH.

[8] William J. Byrne,et al. On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.