Topic Detection for Language Model Adaptation of Highly-Inflected Languages by Using a Fuzzy Comparison Function

A new framework is proposed to construct corpus-based topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. The proposed techniques can be applied to other Slavic languages, where words are formed by many different inflectional affixatation. In this article an attempt to overcome two important difficulties of highly-inflected languages (high out-of-vocabulary rate and the problem of topic detection) is described. The first problem is solved by the decomposition of words into stems and endings, and topic detection is improved by a novel approach for feature extraction based on soft comparison of words. The results of experiments on the second largest Slovenian newspaper news