论文信息 - Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training

Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training

In this paper, we describe and evaluate a bigram part-of-speech (POS) tagger that uses latent annotations and then investigate using additional genre-matched unlabeled data for self-training the tagger. The use of latent annotations substantially improves the performance of a baseline HMM bigram tagger, outperforming a trigram HMM tagger with sophisticated smoothing. The performance of the latent tagger is further enhanced by self-training with a large set of unlabeled data, even in situations where standard bigram or trigram taggers do not benefit from self-training when trained on greater amounts of labeled training data. Our best model obtains a state-of-the-art Chinese tagging accuracy of 94.78% when evaluated on a representative test set of the Penn Chinese Treebank 6.0.

[1] Dan Klein,et al. Analyzing the Errors of Unsupervised Learning , 2008, ACL.

[2] Daniel Jurafsky,et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[3] Daniel Jurafsky,et al. Morphological features help POS tagging of unknown words across language varieties , 2005, IJCNLP.

[4] Wen Wang,et al. Mandarin Part-of-Speech Tagging and Discriminative Reranking , 2007, EMNLP.

[5] Mary P. Harper,et al. A Second-Order Hidden Markov Model for Part-of-Speech Tagging , 1999, ACL.

[6] Jun'ichi Tsujii,et al. Probabilistic CFG with Latent Annotations , 2005, ACL.

[7] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[8] Dan Klein,et al. Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[9] Wen Wang,et al. Semi-Supervised Learning for Part-of-Speech Tagging of Mandarin Transcribed Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10] James R. Curran,et al. Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[11] Dan Klein,et al. Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[12] M. A. R T A P A L,et al. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.