论文信息 - ANALYSIS OF MORPH-BASED LANGUAGE MODELING AND SPEECH RECOGNITION IN SLOVAK

ANALYSIS OF MORPH-BASED LANGUAGE MODELING AND SPEECH RECOGNITION IN SLOVAK

The inflection of the Slovak language causes a large number of unique word forms, which produces not only a large vocabulary, but also a number of out-of-vocabulary words. Morph-based language models solve this problem by decomposition of inflected word forms into small sub-word units and resolve the general problem of sparsity the training data. In this paper, we present several rule-based and data-driven approaches to the automatic segmentation of words into morphs. These data are later used in the modeling of the Slovak language for large vocabulary continuous speech recognition. Preliminary results show a significant decrease in the number of out-of-vocabulary words and reduction of resultant language model perplexity.

Jozef Juhar | Daniel Hladek | Jan Stas | Daniel Zlacky

[1] Andrey Ronzhin,et al. Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis , 2011, INTERSPEECH.

[2] Mikko Kurimo,et al. Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3] William J. Byrne,et al. Morpheme Based Language Models for Speech Recognition of Czech , 2000, TSD.

[4] Milos Cernak,et al. Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.

[5] Daniel Jurafsky,et al. Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[6] Kiyohiro Shikano,et al. Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[7] Jorma Rissanen,et al. The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[8] Mirjam Sepesy Maucec,et al. Large vocabulary continuous speech recognition of an inflected language using stems and endings , 2007, Speech Commun..

[9] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[10] Piotr Majewski,et al. Syllable Based Language Model for Large Vocabulary Continuous Speech Recognition of Polish , 2008, TSD.