论文信息 - Slovak Language Model from Internet Text Data

Slovak Language Model from Internet Text Data

Automatic speech recognition system is one of the parts of the multimodal dialogue system. It is necessary to create correct vocabulary and to generate suitable language model for this purpose. The main aim of this article is to describe a process of building statistical models of the Slovak language with large vocabulary trained on the text data gathered mainly from Internet sources. Several smoothing techniques for different sizes of vocabulary have been used in order to obtain an optimal model of the Slovak language. We have also employed pruning technique based on relative entropy for size reduction of a language model to find the maximum threshold of pruning with minimum degradation in recognition accuracy. Tests were performed by the decoder based on the HTK Toolkit.

Matús Pleva | Jozef Juhár | Daniel Hládek | Ján Stas

[1] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[2] Walid Karam,et al. Multimodal Human Machine Interactions in Virtual and Augmented Reality , 2009, COST 2102 School.

[3] Darjaa Sakhia,et al. MobilDat-SK - a Mobile Telephone Extension to the SpeechDat-E SK Telephone Speech Database in Slovak , 2006 .

[4] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[5] Jozef Juhár,et al. Comparison of Grapheme and Phoneme Based Acoustic Modeling in LVCSR Task in Slovak , 2008, COST 2102 School.

[6] Stanley F. Chen,et al. An empirical study of smoothing techniques for language modeling , 1999 .

[7] Hervé Bourlard,et al. On the Use of Information Retrieval Measures for Speech Recognition Evaluation , 2004 .

[8] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9] Anna Esposito,et al. Multimodal Signals: Cognitive and Algorithmic Issues, COST Action 2102 and euCognition International School Vietri sul Mare, Italy, April 21-26, 2008, Revised Selected and Invited Papers , 2009, COST 2102 School.