The problem of part-of-speech (POS) tagging or disambiguation is a practical issue in natural language processing (NLP) community, especially in the development of a machine translation system. The performance of POS tagging system may interference the subsequent analytical tasks in the translation process, and thereafter affects the overall translation quality. This paper presents a novel POS tagging system, iTagger, which is developed based on Selecting Base Classifiers on Bagging (SBCB) learning algorithm. In this work, the POS tagging task is regarded as a classification problem. Features such as the surrounding context of ambiguous candidates, n-gram information, lexical items and linguistic clues are used and automatically extracted from the annotated corpus. The proposed system has been compared against two state-of-the-art tagging methods, Hidden Markov Model (HMM) and Maximum Entropy. The empirical results conducted on the corpora of (English) Brown corpus, (Portuguese) Tycho Brahe corpus and the Chinese Tree Bank corpus reveal the competitiveness of iTagger. Moreover, the iTagger has been developed and released to the public as library and tool for various development and application purposes.
[1]
M. A. R T A P A L,et al.
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus
,
2005,
Natural Language Engineering.
[2]
Zellig S. Harris,et al.
String Analysis Of Sentence Structure
,
1965
.
[3]
Robert F. Simmons,et al.
A Computational Approach to Grammatical Coding of English Words
,
1963,
JACM.
[4]
Bernard Mérialdo,et al.
Tagging English Text with a Probabilistic Model
,
1994,
CL.
[5]
Thorsten Brants,et al.
TnT – A Statistical Part-of-Speech Tagger
,
2000,
ANLP.
[6]
Eric Brill,et al.
Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging
,
1995,
CL.
[7]
Beatrice Santorini,et al.
Building a Large Annotated Corpus of English: The Penn Treebank
,
1993,
CL.
[8]
Xiaodong Zeng,et al.
Optimization of bagging classifiers based on SBCB algorithm
,
2010,
2010 International Conference on Machine Learning and Cybernetics.