In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique seg-mentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmenta-tions lead to a better performance than the unique segmentation approach.
[1]
Ossama Emam,et al.
Language Model Based Arabic Word Segmentation
,
2003,
ACL.
[2]
Laurent Besacier,et al.
Recent advances in automatic speech recognition for vietnamese
,
2008,
SLTU.
[3]
KurimoMikko,et al.
Morph-based speech recognition and modeling of out-of-vocabulary words across languages
,
2007
.
[4]
Laurent Besacier,et al.
Which units for acoustic and language modeling for Khmer automatic speech recognition?
,
2008,
SLTU.
[5]
Ruhi Sarikaya,et al.
On the use of morphological analysis for dialectal Arabic speech recognition
,
2006,
INTERSPEECH.
[6]
Mehryar Mohri,et al.
A Rational Design for a Weighted Finite-State Transducer Library
,
1997,
Workshop on Implementing Automata.
[7]
Ebru Arisoy,et al.
Morph-based speech recognition and modeling of out-of-vocabulary words across languages
,
2007,
TSLP.