论文信息 - Multiple text segmentation for statistical language modeling

Multiple text segmentation for statistical language modeling

In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique seg-mentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmenta-tions lead to a better performance than the unique segmentation approach.

Laurent Besacier | Brigitte Bigi | Eric Castelli | Sopheap Seng

[1] Ossama Emam,et al. Language Model Based Arabic Word Segmentation , 2003, ACL.

[2] Laurent Besacier,et al. Recent advances in automatic speech recognition for vietnamese , 2008, SLTU.

[3] KurimoMikko,et al. Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007 .

[4] Laurent Besacier,et al. Which units for acoustic and language modeling for Khmer automatic speech recognition? , 2008, SLTU.

[5] Ruhi Sarikaya,et al. On the use of morphological analysis for dialectal Arabic speech recognition , 2006, INTERSPEECH.

[6] Mehryar Mohri,et al. A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[7] Ebru Arisoy,et al. Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.