A maximum entropy approach for vietnamese word segmentation

In this paper, we introduce a new approach for Vietnamese Word Segmentation. The word segmentation problem is restated into the morpho-syllable position-in-word (PIW) tagging problem. We used the Maximum Entropy with the Generalized Iterative Scaling (GIS) to train on the annotated corpora. The result of the training process was used to tag all the morpho-syllables of the input sentence. With the output sentence tagged, we can convert it into a segmented sentence for evaluation. The results on a lot of tagged-corpora show that this approach is suitable for Vietnamese Word Segmentation. The performance achieves precision and recall rates of 94.87% and 94.08% respectively, and the F-measure of 94.44%. Index Terms—word segmentation, maximum entropy