Accuracy of part-of-speech tagging is critical to downstream sub-tasks in front-end text analysis model of text-to-speech System. Uyghuris an agglutinative language in which numbers of words are formed by suffixes attaching to a stem (or root). Owing to there are unlimited new formed and derived syntactic words in Uyghur, Sizes of part-of-speech tagging set were big and out-of-vocabulary words often occurred in conventional Uyghur part-of-speech tagging method which directly trained and predicted the part-of-speech of word. To address this problem, this paper proposes the idea that trains the part-of-speech of stem and predicts the part-of-speech of word mainly by stem. Bi-gram language model is used to segment the stem and affix boundary of word, hidden markov model is used to train and predict part-of-speech of stem. In the end, rule adjusting method is used to adjust the changed part-of-speech of word when suffix attaching to a stem. Experimental result shows that proposed method obviously reduces the part-of-speech tagging error rate comparing to conventional part-of-speech tagging method.
[1]
Tatsuya Kawahara,et al.
Uyghur morpheme-based language models and ASR
,
2010,
IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.
[2]
Jerome R. Bellegarda,et al.
Improved pos tagging for text-to-speech synthesis
,
2011,
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[3]
James H. Martin,et al.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
,
2000
.
[4]
Eric Brill,et al.
Deducing linguistic structure from the statistics of large corpora
,
1990
.
[5]
Dale Schuurmans,et al.
A Hierarchical EM Approach to Word Segmentation
,
2001,
NLPRS.
[6]
Bernard Mérialdo,et al.
Tagging English Text with a Probabilistic Model
,
1994,
CL.
[7]
Yoram Singer,et al.
The Hierarchical Hidden Markov Model: Analysis and Applications
,
1998,
Machine Learning.