In this paper, we introduce a new approach for Vietnamese Word Segmentation. The word segmentation problem is restated into the morpho-syllable position-in-word (PIW) tagging problem. We used the Maximum Entropy with the Generalized Iterative Scaling (GIS) to train on the annotated corpora. The result of the training process was used to tag all the morpho-syllables of the input sentence. With the output sentence tagged, we can convert it into a segmented sentence for evaluation. The results on a lot of tagged-corpora show that this approach is suitable for Vietnamese Word Segmentation. The performance achieves precision and recall rates of 94.87% and 94.08% respectively, and the F-measure of 94.44%. Index Terms—word segmentation, maximum entropy
[1]
Christopher D. Manning,et al.
Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger
,
2000,
EMNLP.
[2]
Nguyen Van Toan,et al.
Vietnamese Word Segmentation
,
2001,
NLPRS.
[3]
Richard Sproat,et al.
The First International Chinese Word Segmentation Bakeoff
,
2003,
SIGHAN.
[4]
Nianwen Xue,et al.
Chinese Word Segmentation as Character Tagging
,
2003,
ROCLING/IJCLCLP.
[5]
J. Darroch,et al.
Generalized Iterative Scaling for Log-Linear Models
,
1972
.
[6]
Adwait Ratnaparkhi,et al.
A Maximum Entropy Model for Part-Of-Speech Tagging
,
1996,
EMNLP.