iTagger: Part-of-Speech Tagging Based on SBCB Learning Algorithm

The problem of part-of-speech (POS) tagging or disambiguation is a practical issue in natural language processing (NLP) community, especially in the development of a machine translation system. The performance of POS tagging system may interference the subsequent analytical tasks in the translation process, and thereafter affects the overall translation quality. This paper presents a novel POS tagging system, iTagger, which is developed based on Selecting Base Classifiers on Bagging (SBCB) learning algorithm. In this work, the POS tagging task is regarded as a classification problem. Features such as the surrounding context of ambiguous candidates, n-gram information, lexical items and linguistic clues are used and automatically extracted from the annotated corpus. The proposed system has been compared against two state-of-the-art tagging methods, Hidden Markov Model (HMM) and Maximum Entropy. The empirical results conducted on the corpora of (English) Brown corpus, (Portuguese) Tycho Brahe corpus and the Chinese Tree Bank corpus reveal the competitiveness of iTagger. Moreover, the iTagger has been developed and released to the public as library and tool for various development and application purposes.