Tibetan Word Segmentation Based on Word-Position Tagging

The best advantage of Tibetan word segmentation based on word-position is to reduce segmentation errors for unknown words. In this article authors upgrade usual 4-tag set to 6-tag set to fit in with the features of Tibetan characters, using CRF as tagging model to train and test corpus data, then building post processing modules to revise the result data. The experimental result shows that this method achieves a good performance and deserves further study, including expanding the corpus and optimizing the tag set and feature templates.

[1]  Yeping He,et al.  Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field , 2011, PACLIC.

[2]  Zhao Hai,et al.  Chinese Word Segmentation: A Decade Review , 2007 .

[3]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[4]  Cai Zhi-jie Identification of Abbreviated Word in Tibetan Word Segmentation , 2009 .

[5]  Lu Yajun A Tibetan Segmentation System—Yangjin , 2011 .

[6]  Tao Jiang,et al.  Tibetan word segmentation system based on conditional random fields , 2011, 2011 IEEE 2nd International Conference on Software Engineering and Service Science.

[7]  Jiang Di,et al.  The methods of lemmatization of bound case markers in modern Tibetan , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.