Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.

[1]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[2]  William J. Byrne,et al.  On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.

[3]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[4]  Mirjam Sepesy Maucec,et al.  Topic detection for language model adaptation of highly-inflected languages by using a fuzzy comparison function , 2001, INTERSPEECH.

[5]  Ciro Martins,et al.  Using partial morphological analysis in language modeling estimation for large vocabulary portuguese speech recognition , 1999, EUROSPEECH.

[6]  Mikko Kurimo,et al.  Large vocabulary statistical language modeling for continuous speech recognition in finnish , 2001, INTERSPEECH.

[7]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[8]  Tetsunori Kobayashi,et al.  Class-combined word n-gram for robust language modeling , 1999, EUROSPEECH.

[9]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[10]  Gailius Raskinis,et al.  Building Medium-Vocabulary Isolated-Word Lithuanian HMM Speech Recognition System , 2003, Informatica.

[11]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[12]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[13]  Laimutis Telksnys,et al.  Development of Isolated Word Speech Recognition System , 2002, Informatica.

[14]  M. Herzog,et al.  Combining word- and class-based language models: a comparative study in several languages using automatic and manual word-clustering techniques , 2001, INTERSPEECH.

[15]  Alex Acero,et al.  Spoken Language Processing , 2001 .