论文信息 - Affix-augmented stem-based language model for persian

Affix-augmented stem-based language model for persian

Language modeling is used in many NLP applications like machine translation, POS tagging, speech recognition and information retrieval. It assigns a probability to a sequence of words. This task becomes a challenging problem for high inflectional languages. In this paper we investigate standard statistical language models on the Persian as an inflectional language. We propose two variations of morphological language models that rely on a morphological analyzer to manipulate the dataset before modeling. Then we discuss shortcoming of these models, and introduce a novel approach that exploits the structure of the language and produces more accurate. Experimental results are encouraging especially when we use n-gram models with small training dataset.

Heshaam Faili | Hadi Ravanbakhsh

[2] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3] Jeff A. Bilmes,et al. Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[4] Karine Megerdoomian,et al. Persian Computational Morphology: A Unification-Based Approach , 2000 .

[5] Andreas Stolcke,et al. Morphology-based language modeling for conversational Arabic speech recognition , 2006, Comput. Speech Lang..

[6] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7] A.-M. Derouault,et al. A morphological model for large vocabulary speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[8] Andreas Stolcke,et al. Morphology-based language modeling for arabic speech recognition , 2004, INTERSPEECH.